Add missing comma in cutlass/arch/mma_sm90.h

expose StoreT parameter for potential speed (#838 )
* expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-11 09:43:30 -08:00 · 2023-03-10 12:58:17 -05:00 · 2023-03-09 23:24:47 -05:00 · 2023-03-09 23:22:56 -05:00 · 2023-03-09 15:27:40 -05:00 · 2023-03-07 11:53:01 -05:00
6973 changed files with 1123956 additions and 299614 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,23 @@
+---
+name: Bug report
+about: Create a bug report to help us improve CUTLASS
+title: "[BUG]"
+labels: "? - Needs Triage, bug"
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**Steps/Code to reproduce bug**
+Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Environment details (please complete the following information):**
+ - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)] 
+
+**Additional context**
+Add any other context about the problem here.
--- a/.github/ISSUE_TEMPLATE/documentation_request.md
+++ b/.github/ISSUE_TEMPLATE/documentation_request.md
@ -0,0 +1,35 @@
+---
+name: Documentation request
+about: Report incorrect or needed documentation to improve CUTLASS
+title: "[DOC]"
+labels: "? - Needs Triage, documentation"
+assignees: ''
+
+---
+
+## Report incorrect documentation
+
+**Location of incorrect documentation**
+Provide links and line numbers if applicable.
+
+**Describe the problems or issues found in the documentation**
+A clear and concise description of what you found to be incorrect.
+
+**Steps taken to verify documentation is incorrect**
+List any steps you have taken:
+
+**Suggested fix for documentation**
+Detail proposed changes to fix the documentation if you have any.
+
+---
+
+## Report needed documentation
+
+**Report needed documentation**
+A clear and concise description of what documentation you believe it is needed and why.
+
+**Describe the documentation you'd like**
+A clear and concise description of what you want to happen.
+
+**Steps taken to search for needed documentation**
+List any steps you have taken:
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@ -0,0 +1,20 @@
+---
+name: Feature request
+about: Suggest an idea for CUTLASS
+title: "[FEA]"
+labels: "? - Needs Triage, feature request"
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I wish I could use CUTLASS to do [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context, code examples, or references to existing implementations about the feature request here.
--- a/.github/ISSUE_TEMPLATE/submit_question.md
+++ b/.github/ISSUE_TEMPLATE/submit_question.md
@ -0,0 +1,10 @@
+---
+name: Submit question
+about: Ask a general question about CUTLASS
+title: "[QST]"
+labels: "? - Needs Triage, question"
+assignees: ''
+
+---
+
+**What is your question?**
--- a/.github/workflows/labeler.yml
+++ b/.github/workflows/labeler.yml
@ -0,0 +1,11 @@
+name: "Pull Request Labeler"
+on:
+- pull_request_target
+
+jobs:
+  triage:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/labeler@main
+      with:
+        repo-token: "${{ secrets.GITHUB_TOKEN }}"
--- a/.github/workflows/new-issues-to-triage-projects.yml
+++ b/.github/workflows/new-issues-to-triage-projects.yml
@ -0,0 +1,35 @@
+name: Auto Assign New Issues to Triage Project
+
+on:
+  issues:
+    types: [opened]
+
+env:
+  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+jobs:
+  assign_one_project:
+    runs-on: ubuntu-latest
+    name: Assign to New Issues to Triage Project
+    steps:
+    - name: Process bug issues
+      uses: docker://takanabe/github-actions-automate-projects:v0.0.1
+      if: contains(github.event.issue.labels.*.name, 'bug') && contains(github.event.issue.labels.*.name, '? - Needs Triage')
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
+        GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
+    - name: Process feature issues
+      uses: docker://takanabe/github-actions-automate-projects:v0.0.1
+      if: contains(github.event.issue.labels.*.name, 'feature request') && contains(github.event.issue.labels.*.name, '? - Needs Triage')
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
+        GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
+    - name: Process other issues
+      uses: docker://takanabe/github-actions-automate-projects:v0.0.1
+      if: contains(github.event.issue.labels.*.name, '? - Needs Triage') && (!contains(github.event.issue.labels.*.name, 'bug') && !contains(github.event.issue.labels.*.name, 'feature request'))
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
+        GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
--- a/.github/workflows/stale.yml
+++ b/.github/workflows/stale.yml
@ -0,0 +1,57 @@
+name: Mark inactive issues and pull requests
+
+on:
+  schedule:
+    - cron: "0 * * * *"
+
+jobs:
+  mark-inactive-30d:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Mark 30 day inactive issues and pull requests
+        uses: actions/stale@v3
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          stale-issue-message: >
+            This issue has been labeled `inactive-30d` due to no recent activity in the past 30 days.
+            Please close this issue if no further response or action is needed.
+            Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
+            This issue will be labeled `inactive-90d` if there is no activity in the next 60 days.
+          stale-issue-label: "inactive-30d"
+          exempt-issue-labels: "0 - Blocked,0 - Backlog,good first issue"
+          days-before-issue-stale: 30
+          days-before-issue-close: -1
+          stale-pr-message: >
+            This PR has been labeled `inactive-30d` due to no recent activity in the past 30 days.
+            Please close this PR if it is no longer required.
+            Otherwise, please respond with a comment indicating any updates.
+            This PR will be labeled `inactive-90d` if there is no activity in the next 60 days.
+          stale-pr-label: "inactive-30d"
+          exempt-pr-labels: "0 - Blocked,0 - Backlog,good first issue"
+          days-before-pr-stale: 30
+          days-before-pr-close: -1
+          operations-per-run: 50
+  mark-inactive-90d:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Mark 90 day inactive issues and pull requests
+        uses: actions/stale@v3
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          stale-issue-message: >
+            This issue has been labeled `inactive-90d` due to no recent activity in the past 90 days.
+            Please close this issue if no further response or action is needed.
+            Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
+          stale-issue-label: "inactive-90d"
+          exempt-issue-labels: "0 - Blocked,0 - Backlog,good first issue"
+          days-before-issue-stale: 90
+          days-before-issue-close: -1
+          stale-pr-message: >
+            This PR has been labeled `inactive-90d` due to no recent activity in the past 90 days.
+            Please close this PR if it is no longer required.
+            Otherwise, please respond with a comment indicating any updates.
+          stale-pr-label: "inactive-90d"
+          exempt-pr-labels: "0 - Blocked,0 - Backlog,good first issue"
+          days-before-pr-stale: 90
+          days-before-pr-close: -1
+          operations-per-run: 50
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,2 @@
+# PyCache files
+__pycache__/
--- a/tools/external/googletest/googlemock/build-aux/.keep
+++ b/tools/external/googletest/googlemock/build-aux/.keep
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -0,0 +1,330 @@
+# NVIDIA CUTLASS Changelog
+
+
+## [3.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.0.0) (2023-01-23)
+* [CuTe](/media/docs/cute/00_quickstart.md), a [new core library and backend](/include/cute) for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
+* [A new conceptual operation hierarchy](media/docs/cutlass_3x_design.md) that replaces the architecture-centric hierarchy of CUTLASS 2.x and [documentation for CUTLASS 3.0's GEMM API changes](/media/docs/gemm_api_3x.md).
+* Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same [`device::GemmUniversalAdapter`](include/cutlass/gemm/device/gemm_universal_adapter.h) and [`kernel::GemmUniversal`](include/cutlass/gemm/kernel/gemm_universal.hpp) types, allowing users to include both APIs in the same translation units. More information can be found in the [3.x backwards compatibility section](media/docs/cutlass_3x_backwards_compatibility.md).
+* Updates to [Functionality](media/docs/functionality.md) which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
+* Updates to [Compatibility](/README.md#compatibility) Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and [Target Architecture](/README.md#Target-Architecture).
+* New warp-specialized GEMM [kernel schedules](include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [mainloops](include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
+* Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
+* [CUTLASS library integration](/tools/library/src/gemm_operation_3x.hpp) for 3.x API kernels built through the new `CollectiveBuilder` API, enabling CUTLASS profiler.
+* Support for [Hopper GEMMs](examples/48_hopper_warp_specialized_gemm) through the new 3.0 API with CuTe-based exposure of the Hopper [Tensor Memory Accelerator](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor) and [WGMMA Tensor Core](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) features.
+* Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples [48](examples/48_hopper_warp_specialized_gemm), [49](examples/49_hopper_gemm_schedules_with_collective_builder), and [50](examples/50_hopper_gemm_with_epilogue_swizzle).
+
+## [2.11.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.11.0) (2022-11-19)
+* [Stream-K](/examples/47_ampere_gemm_universal_streamk), which is a new general way to do split-K.  It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
+* [Fused multi-head attention Kernel](/examples/41_fused_multi_head_attention).  It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length.  Both versions just need one kernel.
+* [Dual GEMM](/examples/45_dual_gemm), which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
+* Hopper improves [double precision matrix multiplication](/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
+* [BLAS3](/test/unit/gemm/device/hemm_cf64_cf64_cf64_tensor_op_f64_sm90.cu) functions with Hoppers new double precision matrix multiplication instructions.
+* [ELL Block Sparse GEMM](/examples/43_ell_block_sparse_gemm), which uses an [ELL matrix](https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/) to describe the sparsity of A matrix.  B and output matrices are still dense. The block size can be arbitary.
+* Optimized [Group Conv](/examples/42_ampere_tensorop_group_conv) for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
+* [Optimized DepthWise Conv](/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu).  Two new modes are added
+  * [kOptimized](/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - use direct conv to compute instead of implicit GEMM. 
+    *  The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
+  * [kFixedStrideDilation](/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_fixed_stride_dilation_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
+    * The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration. 
+* [Scripts](/examples/44_multi_gemm_ir_and_codegen) to fuse multiple back-to-back GEMM.  Its implementation was discussed in a GTC'22 Spring [talk](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41606/).
+* [FP8 data type definition](/include/cutlass/float8.h) and [conversion routines](/include/cutlass/numeric_conversion.h#L1274-2115).
+* Updates and bugfixes from the community (thanks!).  Big shout out to Meta's [xFormers](https://github.com/facebookresearch/xformers).
+
+* **Deprecation announcement:** CUTLASS plans to deprecate the following:
+  * Maxwell and Pascal GPU architectures
+  * Ubuntu 16.04
+  * CUDA 10.2
+
+## [2.10.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.10.0) (2022-08-23)
+* [CUTLASS Python](/examples/40_cutlass_py) now supports GEMM, CONV, Group GEMM for different data types as well as different epilogue flavours.  
+* Optimizations for CUTLASS's [Grouped GEMM](examples/24_gemm_grouped/gemm_grouped.cu) kernel.  Threadblock scheduling part is improved.  Some computation can be moved to the host side if applicable.  [Grouped Syr2k](examples/38_syr2k_grouped/syr2k_grouped.cu) kernels are added, too.
+* Optimizations for [GEMM+Softmax](examples/35_gemm_softmax).  All the reduction computation is fused into the previous GEMM.  More template arguments are provided to fine tune the performance.
+* [Grouped GEMM for Multihead Attention](examples/41_multi_head_attention).  This general group gemm based MHA does not require the sequence length of all GEMMs to be the same which makes it most useful for natural language processing.
+* [GEMM + Layer norm fusion for Ampere](examples/37_gemm_layernorm_gemm_fusion/) splits the layernorm into two parts and both of them can be fused into the GEMMs before and after separately.  In addition to use square sum to compute variance of layernorm, [Shift-K](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data) is provided if square sum raise numerical issues.
+* [GEMM Epilogue Permutation Fusion](examples/39_gemm_permute) can apply user provided permutation layout mapping in the GEMM epilogue.
+* [Grouped convolution targeting implicit GEMM](test/unit/conv/device/group_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) introduces the first group convolution implementation to CUTLASS.  It is an Analytical implementation, not an Optimized.  The restrictions are: 1) input and output channel number should be multiple of group number. 2) split-K is not supported.  The implementation has 2 modes: 
+  * kSingleGroup: output channel per group is multiple of Threadblock tile N.
+  * kMultipleGroup: Threadblock tile N is multiple of output channel per group.
+* [Depthwise separable convolution](test/unit/conv/device/depthwise_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) introduces the first depthwise convolution which is also Analytical for now.  The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number.
+* Standalone [Layernorm](/tools/util/include/cutlass/util/device_layernorm.h) and [Pooling](/tools/util/include/cutlass/util/device_nhwc_pooling.h) kernels.
+* [Back-to-back GEMM/CONV](examples/13_two_tensor_op_fusion) relaxes the requirement that the first GEMM K dimension needs to be the multiple of Threadblock Tile K dimension.
+* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
+* Updates and bugfixes from the community (thanks!)
+
+## [2.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.9.0) (2022-04-21)
+
+* [First layer Convolution kernels](/test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) specialized for small channel counts and reduced alignment
+  * [Few channels](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_few_channels.h) specialization for reduced alignment capabilities
+  * [Fixed channels](/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_fixed_channels.h) further specialized when channel count perfectly matches the access vector size
+  * [Unit tests](/test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu)
+  * [Python-based instance emitter](/tools/library/scripts/generator.py) in the CUTLASS Library and support in the Profiler
+* [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores
+  * Supported types: f32, cf32, f64, cf64, tf32x3, complex tf32x3
+  * [HERK](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
+  * [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
+  * [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/symm_operation.py)
+  * [TRMM](/test/unit/gemm/device/trmm_f32n_f32t_f32t_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/trmm_operation.py)
+  * [Unit tests](/test/unit/gemm/device/testbed_rank_k_universal.h)
+* [CUTLASS Python](/examples/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python)
+  * [Python-based runtime](/tools/library/scripts/rt.py) interoperable with existing emitters
+* [GEMM + Softmax example](/examples/35_gemm_softmax)
+* [Gather and Scatter Fusion with GEMM](/examples/36_gather_scatter_fusion) can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel.
+  * It can select random rows in a row major matrix.
+  * It can select random columns in a column major matrix.
+* [Back-to-back GEMM/CONV](examples/13_two_tensor_op_fusion) fully supports buffering the first GEMM/CONV results in the shared memory for the latter one to use.  It can eliminate register spill when the tile size is big.  Additionally, bias vector add is supported in the first GEMM/CONV.
+  * Supported kernels: GEMM and CONV.
+  * Supported types: fp16 and int8.
+  * Supported architectures: Turing and Ampere.
+* [Transposed Convolution](/examples/34_transposed_conv2d) (a.k.a Deconvolution) support which reuses Dgrad implementation.
+* [Utility functions](/tools/util/include/cutlass/util) that can pad NHWC and convert between NCHW and NHWC.
+* [Small alignment implicit gemm](https://github.com/NVIDIA/cutlass/issues/242) support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores in these kernels.
+* Epilogue enhancement:
+  * Eliminate bank conflicts in int8 tensor core kernels.
+  * Half2 usage if epilogue compute type is fp16.
+  * More activation functions: Silu, Hardswish, Leaky Relu.
+  * New elementwise fusion pattern for [residual block](/include/cutlass/epilogue/thread/linear_combination_residual_block.h).
+* [Group GEMM](/examples/24_gemm_grouped) thread block number calculation fix which helps to launch the intended number of threadblocks to fully occupy the GPUs.
+* [Parallel GEMM splitk](https://github.com/NVIDIA/cutlass/pull/277) support in the CUTLASS profiler.
+* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
+* Updates and bugfixes from the community (thanks!)
+
+
+## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)
+
+* **TF32x3:** emulated single-precision using Tensor Cores
+  * 45+ TFLOPs on NVIDIA A100
+  * [GEMM SDK example](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu) (real)
+  * [COMPLEX GEMM SDK example](/examples/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm.cu) (complex)
+  * [Implicit GEMM Convolution SDK example](/examples/28_ampere_3xtf32_fast_accurate_tensorop_fprop/ampere_3xtf32_fast_accurate_tensorop_fprop.cu)
+* **Mainloop fusion for Convolution:** convolution with fused per-channel scale-bias-relu
+  * [Conv Fprop SDK example](/examples/25_ampere_fprop_mainloop_fusion/ampere_fprop_mainloop_fusion.cu)
+  * [Conv WGrad SDK example](/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu) 
+  * [cutlass::conv::device::ImplicitGemmConvolutionFusion](/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h)
+* **Grouped GEMM:** similar to batched GEMM with distinct problem size per group
+  * [SDK example](/examples/24_gemm_grouped) with performance comparison with Batched Strided GEMM
+  * [cutlass::gemm::device::GemmGrouped](/include/cutlass/gemm/device/gemm_grouped.h)
+* [Implicit GEMM Convolution fusion](/examples/13_two_tensor_op_fusion/) supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
+* Optimal performance using [**CUDA 11.5**](https://developer.nvidia.com/cuda-downloads)
+* Updates from the community (thanks!)
+
+* **Deprecation announcement:** CUTLASS plans to deprecate the following:
+  * Maxwell and Pascal GPU architectures
+  * Ubuntu 16.04
+  * CUDA 10.2
+
+## [2.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.7.0) (2021-09-24)
+  * Mainloop fusion for GEMM: [summation over A or B](/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu)
+  * [Strided DGRAD (optimized iterators)](/include/cutlass/conv/kernel/default_conv2d_dgrad.h)
+  * [Half-precision GELU_taylor activation functions](/include/cutlass/epilogue/thread/activation.h#L196)
+    * Use these when accumulation and epilogue compute types are all `cutlass::half_t`
+  * Tuning and bug fixes to [fused GEMM + GEMM example](/examples/13_two_tensor_op_fusion/)
+  * Support for smaller than 128b aligned Convolutions: [see examples](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272)
+  * Caching of results to accelerate Convolution [unit tests](test/unit/conv/device/cache_testbed_output.h)
+    * Can be enabled or disabled by running `cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF`
+  * Corrections and bug fixes reported by the CUTLASS community
+    * Thank you for filing these issues!
+
+## [2.6.1](https://github.com/NVIDIA/cutlass/releases/tag/v2.6.1) (2021-09-03)
+  * Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
+  * Tuning for GEMMs fused with partial reductions
+  * Corrections and bug fixes reported by the CUTLASS community
+    * Thank you for filing these issues!
+
+## [2.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.6.0) (2021-07-22)
+  * Optimal performance when compiled with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit)
+    * Adopt the new L2 prefetch feature in [cp.async](/include/cutlass/arch/memory.h) and [global load](/include/cutlass/arch/memory_sm80.h)
+  * Fused operators with GEMM and Convolution
+    * [Fused broadcast in epilogue](test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu)
+    * [Fused partial reduction in epilogue](/test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu)
+  * 64b tensor strides and leading dimensions support for GEMMs
+  * Affine rank=2 matrix layouts 
+    * Row stride and column stride for matrices using [cutlass::layout::AffineRank2](/include/cutlass/layout/matrix.h)
+    * Support [FP64 tensor core](/examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu) and SIMT GEMM.
+  * [Batched GEMV](/test/unit/gemm/device/gemv.cu) preview implementation
+  * [New strided Dgrad](test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) implementation
+    * Accelerates over previous implementation by cutting down redundant math by 4x
+    * Support using new `Dy` and `w` analytic iterators and existing `cutlass::conv::device::ImplicitGemmConvolution` interface
+  * Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
+    * Updates to [quaternion.h](/include/cutlass/quaternion.h) and [functional.h](/include/cutlass/functional.h)
+    * SDK Example for [GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_gemm/quaternion_conv.cu)
+    * [Unit tests for GEMM](/test/unit/gemm/device/simt_qgemm_nn_sm50.cu) and [Convolution](/test/unit/conv/device/conv2d_fprop_implicit_gemm_qf32nhwc_qf32nhwc_qf32nhwc_simt_f32_sm50.cu)
+  * Many improvements to the epilogue.
+    * Provide an [option](/include/cutlass/epilogue/threadblock/epilogue.h) to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
+    * Performance improvement for FP16 tensor core kernels
+    * Bug fixes
+  * Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.   
+  * Updated minimum CUDA Toolkit requirement to 10.2
+    * [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit) recommended
+  * Corrections and bug fixes reported by the CUTLASS community
+    * Thank you for filing these issues!
+
+## [2.5.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.5.0) (2021-02-26)
+  * Tensor reductions
+    * _m_-to-_n_ reductions of tensors with affine layout
+    * [Specializations](/test/unit/reduction/device/tensor_reduce_contiguous.cu) for reductions including contiguous dimension
+    * [Specializations](/test/unit/reduction/device/tensor_reduce_strided.cu) for reductions excluding contiguous dimension
+    * Custom reduction functors such as `cutlass::logical_and`
+    * Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
+  * Optimizations for 3-D convolution
+    * [Optimized tile iterators](include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h) using precomputed delta table for 3-D convolution
+    * Full coverage of [forward](test/unit/conv/device/conv3d_fprop_implicit_gemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_f32_sm80.cu) and [backwards](test/unit/conv/device/conv3d_dgrad_implicit_gemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_f32_sm80.cu) passes for 3D convolution
+  * [Fused Convolution+Convolution example](/examples/13_two_tensor_op_fusion/README.md)
+  * Corrections and bug fixes reported by the CUTLASS community
+    * Thank you for filing these issues!
+  
+
+## [2.4.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.4.0) (2020-11-19)
+  * Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
+    * Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
+    * Data type: FP32, complex<FP32>, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
+    * Spatial dimensions: 1-D, 2-D, and 3-D
+    * Layout: NHWC, NCxHWx
+  * Implicit GEMM convolution components: 
+    * Global memory iterators supporting Fprop, Dgrad, and Wgrad
+    * `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
+    * `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
+    * [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
+
+## [2.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.3.0) (2020-09-23)
+ * [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
+   * [Sparse Tensor Core GEMM kernels](test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu):
+     * Direct access to Sparse Tensor Cores and maximum performance via [`mma.sp.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends)
+   * Fast SGEMM targeting GeForce RTX 30-series CUDA Cores
+ * Minor Features:
+   * [Activation functions](/include/cutlass/epilogue/thread/activation.h) such as [GeLU](/include/cutlass/epilogue/thread/linear_combination_gelu.h) and [Sigmoid](/include/cutlass/epilogue/thread/linear_combination_sigmoid.h)
+   * Small [matrix](/include/cutlass/matrix.h) and [quaternion](/include/cutlass/quaternion.h) template classes in device code
+   * [Floating-point constants](/include/cutlass/constants.h)
+ * NVIDIA Ampere GPU Architecture examples and documentation:
+   * [Tensor Float 32](/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu) and 
+   * [Sparse Tensor Cores](/examples/15_ampere_sparse_tensorop_gemm/ampere_sparse_tensorop_gemm.cu)
+   * Documentation added on CUTLASS [efficient row-major epilogue](/media/docs/gemm_api.md#efficient-epilogue)
+
+## [2.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.2.0) (2020-06-08)
+ * [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
+   * Fast Tensor Core operations: 
+    * Maximum performance via [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends)
+    * Tensor Float 32, BFloat16, and double-precision data types
+    * Mixed integer data types (int8, int4, bin1)
+   * Asynchronous copy for deep software pipelines via [`cp.async`](https://docs.nvidia.com/cuda/parallel-thread-execution)   
+   * Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745) (free registration required)
+ * Features:
+   * SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
+   * Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
+   * Gaussian complex GEMMs using 3m complex multiply algorithm
+   * Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
+ * Policy updates:
+   * [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) needed to enable NVIDIA Ampere Architecture features
+   * Disabled F16C by default for compatibility - enable on cmake command line with `-DCUTLASS_ENABLE_F16C=ON`
+
+## [2.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.1.0) (2020-04-06)
+ * BLAS-style host-side API added to [CUTLASS Library](/media/docs/quickstart.md#cutlass-library)
+    * API to launch compiled kernel instances for GEMM and planar complex GEMM
+ * Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
+    * Computes complex matrix products on matrices stored as disjoint real and imaginary parts
+    * [SDK Examples of Planar Complex GEMMs](/examples/10_planar_complex/planar_complex.cu)
+ * Minor enhancements and bug fixes
+
+## [2.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.0.0) (2019-11-19)
+ * Substantially refactored for
+    * Better performance, particularly for native Turing Tensor Cores
+    * Robust and durable templates spanning the design space
+    * Encapsulated functionality embodying modern C++11 programming techniques
+    * Optimized containers and data types for efficient, generic, portable device code
+  * Updates to:
+    * [Quick start guide](/media/docs/quickstart.md)
+    * [Documentation](/README.md#documentation)
+    * [Utilities](/media/docs/utilities.md)
+    * [CUTLASS Profiler](/media/docs/profiler.md)
+ * Native Turing Tensor Cores
+    * Efficient GEMM kernels targeting Turing Tensor Cores
+    * Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
+ * Coverage of existing CUTLASS functionality
+    * GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
+    * Volta Tensor Cores through native mma.sync and through WMMA API
+    * Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
+    * Batched GEMM operations
+    * Complex-valued GEMMs
+ * **Note: a host compiler supporting C++11 or greater is required.**
+
+# CUTLASS 1.x
+
+## [1.3.2](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.2) (2019-07-09)
+ * Performance improvement for Volta Tensor Cores TN and TT layouts.
+
+## [1.3.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.1) (2019-04-09)
+ * Corrected NVRTC unit tests.
+
+## [1.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.0) (2019-03-20)
+ * Efficient GEMM kernel targeting Volta Tensor Cores via `mma.sync` instruction added in CUDA 10.1.
+
+## [1.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.2.0) (2018-10-26)
+ * Parallelized reductions across threadblocks ("Split-K")
+   * Improved IGEMM performance
+ * Batched strided WMMA GEMMs
+
+## [1.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.1.0) (2018-09-19)
+  * Turing Features
+    * WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit
+  * Batched Strided GEMM
+  * Threadblock rasterization strategies
+    * Improved performance for adverse problem sizes and data layouts
+  * Extended CUTLASS Core comonents
+    * Tensor views support arbitrary matrix and tensor layouts
+    * Zip iterators for structuring multiple data streams
+  * Enhanced CUTLASS utilities
+    * Reference code for tensor operations in host and device code
+    * Added HostMatrix<> for simplified matrix creation
+  * Examples
+    * Basic GEMM, tensor views, CUTLASS utilities, batched GEMM, WMMA GEMM
+
+## [1.0.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.0.1) (2018-06-11)
+
+  * Intra-threadblock reduction added for small threadblock tile sizes
+    * sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
+    * igemm_32x32x128
+  * GEMM _K_ residue handled during prologue prior to mainloop
+  * Replaced Google Test copy with submodule. Use `git submodule init --recursive --update`
+
+## [1.0.0](https://github.com/NVIDIA/cutlass/commit/2028ebe120aab22bfd0b2baf8902d4c9627eb33f) (2018-05-16)
+
+  * Substantial rewrite to accommodate new architecture
+  * Kernels: SGEMM, DGEMM, IGEMM, HGEMM, WMMA GEMM
+  * Unit and performance tests
+
+## [0.0.1](https://github.com/NVIDIA/cutlass/commit/d08ba8ac46e2fa3f745e070c390182edb56b2e91) (2017-12-04)
+
+  * Initial release
+
+
+## Copyright
+
+Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+```
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+```
+
--- a/CITATION.cff
+++ b/CITATION.cff
@ -0,0 +1,112 @@
+cff-version: 1.2.0
+title: CUTLASS
+message: >-
+  If you use this software, please cite using the
+  following metadata.
+type: software
+authors:
+  - given-names: Vijay
+    family-names: Thakkar
+    email: vithakkar@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Pradeep
+    family-names: Ramani
+    email: prramani@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Cris
+    family-names: Cecka
+    email: ccecka@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Aniket
+    family-names: Shivam
+    email: ashivam@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Honghao
+    family-names: Lu
+    email: honghaol@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Ethan
+    family-names: Yan
+    email: etyan@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Jack
+    family-names: Kosaian
+    email: jkosaian@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Mark
+    family-names: Hoemmen
+    email: mhoemmen@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Haicheng
+    family-names: Wu
+    email: haichengw@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Andrew
+    family-names: Kerr
+    email: akerr@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Matt
+    family-names: Nicely
+    email: mnicely@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Duane
+    family-names: Merrill
+    email: dumerrill@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Dustyn
+    family-names: Blasig
+    email: dblasig@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Fengqi
+    family-names: Qiao
+    email: fqiao@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Piotr
+    family-names: Majcher
+    email: pmajcher@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Paul
+    family-names: Springer
+    email: pspringer@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Markus
+    family-names: Hohnerbach
+    affiliation: NVIDIA
+    email: mhohnerbach@nvidia.com
+  - given-names: Jin
+    family-names: Wang
+    email: jinw@nvidia.com
+    affiliation: NVIDIA
+  - given-names: Manish
+    family-names: Gupta
+    affiliation: Google
+    email: manigupta@google.com
+
+
+repository-code: 'https://github.com/NVIDIA/cutlass'
+abstract: >-
+  CUTLASS is a collection of CUDA C++ template
+  abstractions for implementing high-performance
+  matrix-multiplication (GEMM) and related
+  computations at all levels and scales within CUDA.
+  It incorporates strategies for hierarchical
+  decomposition and data movement similar to those
+  used to implement cuBLAS and cuDNN. CUTLASS
+  decomposes these "moving parts" into reusable,
+  modular software components abstracted by C++
+  template classes. These thread-wide, warp-wide,
+  block-wide, and device-wide primitives can be
+  specialized and tuned via custom tiling sizes, data
+  types, and other algorithmic policy. The resulting
+  flexibility simplifies their use as building blocks
+  within custom kernels and applications.
+keywords:
+  - 'cutlass, tensor cores, cuda, cute, nvidia, gpu, linear algebra, matrix computations'
+license: BSD-3-Clause
+license-url: https://github.com/NVIDIA/cutlass/blob/v3.0.0/LICENSE.txt
+version: '3.0.0'
+date-released: '2023-01-23'
+identifiers:
+  - type: url
+    value: "https://github.com/NVIDIA/cutlass/tree/v3.0.0"
+    description: The GitHub release URL of tag 3.0.0
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1,66 +1,159 @@
-# Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
 #
-# Redistribution and use in source and binary forms, with or without modification, are permitted
-# provided that the following conditions are met:
-#     * Redistributions of source code must retain the above copyright notice, this list of
-#       conditions and the following disclaimer.
-#     * Redistributions in binary form must reproduce the above copyright notice, this list of
-#       conditions and the following disclaimer in the documentation and/or other materials
-#       provided with the distribution.
-#     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
-#       to endorse or promote products derived from this software without specific prior written
-#       permission.
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
 #
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
-# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
-# FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
-# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
-# BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
-# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
-# STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

-cmake_minimum_required(VERSION 3.3.0)
+cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

-set(CUTLASS_LANGUAGES CXX)
-
-# CMake 3.9.0 has native support for CUDA without the need of the CUDA package. Use it!
-if(WIN32 AND NOT ${CMAKE_VERSION} VERSION_LESS "3.9.0")
-  list(APPEND CUTLASS_LANGUAGES CUDA)
-  set(CUTLASS_NATIVE_CUDA TRUE)
-
-  macro(cutlass_add_executable)
-    add_executable(${ARGN})
-  endmacro()
+if(cutlass_LOADED)
+  # If CUTLASS has been previously fetched and loaded, don't do it again.
+  return()
 else()
-  # FindCUDA fails to detect VS 2017 due to a changed directory format of the toolkits.
-  # For this configuration we need CMake >= 3.9.0 to use the native CUDA support.
-  if (WIN32 AND MSVC_VERSION GREATER 1800)
-    message(FATAL_ERROR "Please upgrade CMake to version >= 3.9.0 to support Visual Studio 2017 or higher")
-  endif()
-
-  # Fall back to the FindCUDA version to create an executable with CUDA files
-  macro(cutlass_add_executable)
-    cuda_add_executable(${ARGN})
-  endmacro()
+  set(cutlass_LOADED ON)
+  set(CUTLASS_DIR ${CMAKE_CURRENT_SOURCE_DIR} CACHE PATH "CUTLASS Repository Directory")
 endif()

-project(CUTLASS ${CUTLASS_LANGUAGES})
+message(STATUS "CMake Version: ${CMAKE_VERSION}")
+set(IMPLICIT_CMAKE_CXX_STANDARD OFF CACHE BOOL "Do not explicitly specify -std=c++11 if set")

-# check if the configuration is supported
-if( NOT CMAKE_SIZEOF_VOID_P EQUAL 8 )
-    message(FATAL_ERROR "CUTLASS requires a 64-bit compiler!")
+project(CUTLASS VERSION 3.0.0 LANGUAGES CXX)
+include(${CMAKE_CURRENT_SOURCE_DIR}/CUDA.cmake)
+
+if (CUDA_VERSION VERSION_LESS 11.3)
+  message(WARNING "CUTLASS ${CUTLASS_VERSION} requires CUDA 11.4 or higher, and strongly recommends CUDA 11.8 or higher.")
+elseif (CUDA_VERSION VERSION_LESS 11.4)
+  message(WARNING "CUTLASS ${CUTLASS_VERSION} support for CUDA ${CUDA_VERSION} is deprecated, please use CUDA 11.8 or higher.")
+endif()
+
+if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 7.5)
+  message(FATAL_ERROR "GCC version must be at least 7.5!")
+endif()
+
+if (CUDA_COMPILER MATCHES "[Cc]lang" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 7.0)
+  message(FATAL_ERROR "Clang 7.0+ required for GPU compilation")
 endif()

-find_package(CUDA)
 find_package(Doxygen QUIET)

-# By default we want to build in Release mode to ensure that we're getting best performance
-if (NOT (CMAKE_BUILD_TYPE OR CONFIGURATION_TYPES))
-  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose build level" FORCE)
-  # We do support Debug or Release builds
-  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release")
+#
+# CUTLASS 3.x requires C++17
+#
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CXX_EXTENSIONS OFF)
+
+if(CUTLASS_NATIVE_CUDA)
+  set(CMAKE_CUDA_STANDARD 17)
+  set(CMAKE_CUDA_STANDARD_REQUIRED ON)
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --expt-relaxed-constexpr)
+else()
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --std=c++17)
 endif()
+  
+if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
+  set(CMAKE_INSTALL_PREFIX install CACHE PATH "Default installation location." FORCE)
+endif()
+
+message(STATUS "Default Install Location: ${CMAKE_INSTALL_PREFIX}")
+
+set(CUTLASS_ENABLE_HEADERS_ONLY OFF CACHE BOOL "Enable only the header library")
+
+if(CUTLASS_ENABLE_HEADERS_ONLY)
+  set(CUTLASS_ENABLE_EXAMPLES_INIT OFF)
+  set(CUTLASS_ENABLE_TOOLS_INIT ON)
+  set(CUTLASS_ENABLE_LIBRARY_INIT OFF)
+else()
+  set(CUTLASS_ENABLE_EXAMPLES_INIT ON)
+  set(CUTLASS_ENABLE_TOOLS_INIT ON)
+  set(CUTLASS_ENABLE_LIBRARY_INIT ON)
+endif()
+
+set(CUTLASS_TEST_UNIT_ENABLE_WARNINGS OFF CACHE BOOL "Enable warnings on waived unit tests.")
+
+set(CUTLASS_ENABLE_EXAMPLES ${CUTLASS_ENABLE_EXAMPLES_INIT} CACHE BOOL "Enable CUTLASS Examples")
+set(CUTLASS_ENABLE_TOOLS ${CUTLASS_ENABLE_TOOLS_INIT} CACHE BOOL "Enable CUTLASS Tools")
+set(CUTLASS_ENABLE_LIBRARY ${CUTLASS_ENABLE_LIBRARY_INIT} CACHE BOOL "Enable CUTLASS Library")
+set(CUTLASS_ENABLE_PROFILER ${CUTLASS_ENABLE_LIBRARY} CACHE BOOL "Enable CUTLASS Profiler")
+set(CUTLASS_ENABLE_PERFORMANCE ${CUTLASS_ENABLE_PROFILER} CACHE BOOL "Enable CUTLASS Proformance")
+
+if(${CMAKE_PROJECT_NAME} STREQUAL ${PROJECT_NAME})
+  set(CUTLASS_ENABLE_TESTS_INIT ${CUTLASS_ENABLE_LIBRARY}})
+else()
+  set(CUTLASS_ENABLE_TESTS_INIT OFF)
+endif()
+
+set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
+
+if (CUTLASS_ENABLE_TESTS)
+  include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/googletest.cmake)
+endif()
+
+set(CUTLASS_NVCC_ARCHS_SUPPORTED "")
+if (CUDA_VERSION VERSION_GREATER_EQUAL 11.4 AND NOT CUDA_COMPILER MATCHES "[Cc]lang")
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 70 72 75 80 86 87)
+endif()
+if (CUDA_VERSION VERSION_GREATER_EQUAL 11.8 AND NOT CUDA_COMPILER MATCHES "[Cc]lang")
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 89 90)
+endif()
+if (CUDA_VERSION VERSION_GREATER_EQUAL 12.0 AND NOT CUDA_COMPILER MATCHES "[Cc]lang")
+  list(APPEND CUTLASS_NVCC_ARCHS_SUPPORTED 90a)
+endif()
+set(CUTLASS_NVCC_ARCHS ${CUTLASS_NVCC_ARCHS_SUPPORTED} CACHE STRING "The SM architectures requested.")
+set(CUTLASS_NVCC_ARCHS_ENABLED ${CUTLASS_NVCC_ARCHS} CACHE STRING "The SM architectures to build code for.")
+
+# Special policy introduced in CMake 3.13
+if (POLICY CMP0076)
+  cmake_policy(SET CMP0076 NEW)
+endif()
+
+include(GNUInstallDirs)
+
+link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64/stubs)
+
+###################################################################################################
+#
+# Configure CMake variables
+#
+###################################################################################################
+
+message(STATUS "CUDA Compilation Architectures: ${CUTLASS_NVCC_ARCHS_ENABLED}")
+
+if (NOT (CMAKE_BUILD_TYPE OR CONFIGURATION_TYPES))
+  # By default we want to build in Release mode to ensure that we're getting best performance.
+  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose build level" FORCE)
+  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "RelWithDebInfo" "Release")
+endif()
+
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+if (DEFINED CMAKE_DEBUG_POSTFIX)
+  set(CUTLASS_LIBRARY_DEBUG_POSTFIX_INIT ${CMAKE_DEBUG_POSTFIX})
+else()
+  set(CUTLASS_LIBRARY_DEBUG_POSTFIX_INIT .debug)
+endif()
+set(CUTLASS_LIBRARY_DEBUG_POSTFIX ${CUTLASS_LIBRARY_DEBUG_POSTFIX_INIT} CACHE STRING "Default postfix value for debug libraries")

 if(WIN32)
  # On Windows we link against the shared (DLL) runtime. Change gtest settings to match this.
@ -69,96 +162,397 @@ endif()

 if (WIN32)
  # Enable more warnings and treat as errors
-  string(APPEND NVCC_FLAGS " -Xcompiler /W3 -Xcompiler /WX")
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/W3 -Xcompiler=/WX)
+
+  # Disable warning on Unicode characters
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/wd4819)

  # Disable excess x86 floating point precision that can lead to results being labeled incorrectly
-  string(APPEND NVCC_FLAGS " -Xcompiler /fp:strict")
-
-  # Verbose option
-  if (${CUTLASS_NVCC_VERBOSE})
-      string(APPEND NVCC_FLAGS " -v")
-  endif()
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/fp:strict)
 endif(WIN32)

-# Configure CUDA options
-set(CUTLASS_NVCC_ARCHS             "50;60;61;70"  CACHE STRING "The SM architectures to build code for.")
-set(CUTLASS_NVCC_KEEP              OFF            CACHE BOOL "Keep intermediate files generated by NVCC.")
-
-foreach(ARCH ${CUTLASS_NVCC_ARCHS})
-  string(APPEND NVCC_FLAGS " -gencode arch=compute_${ARCH},code=sm_${ARCH}")
-endforeach()
-
-
-if (CUTLASS_NVCC_KEEP)
-    string(APPEND NVCC_FLAGS " -keep")
+if (${CUTLASS_NVCC_VERBOSE})
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -v)
 endif()

-if (WIN32 AND CUTLASS_NATIVE_CUDA)
-  string(APPEND NVCC_FLAGS_RELEASE " -lineinfo")
+#
+# CUTLASS NAMESPACE 
+#
+set(CUTLASS_NAMESPACE "cutlass" CACHE STRING "Top level namespace of CUTLASS")
+
+set(CUTLASS_NVCC_EMBED_CUBIN ON CACHE BOOL "Embed compiled CUDA kernel binaries into executables.")
+set(CUTLASS_NVCC_EMBED_PTX ON CACHE BOOL "Embed compiled PTX into executables.")
+set(CUTLASS_NVCC_KEEP OFF CACHE BOOL "Keep intermediate files generated by NVCC.")
+set(CUTLASS_ENABLE_F16C OFF CACHE BOOL "Enable F16C x86 extensions in host code.")
+
+#
+# CUTLASS generator cmake configuration
+#
+set(CUTLASS_LIBRARY_OPERATIONS "all" CACHE STRING "Comma delimited list of operation name filters. Default '' means all operations are enabled.")
+set(CUTLASS_LIBRARY_KERNELS "" CACHE STRING "Comma delimited list of kernel name filters. If unspecified, only the largest tile size is enabled. If 'all' is specified, all kernels are enabled.")
+set(CUTLASS_LIBRARY_IGNORE_KERNELS "" CACHE STRING "Comma delimited list of kernel names to exclude from build.")
+
+# Test Levels L0, L1, L2
+set(CUTLASS_TEST_LEVEL "0" CACHE STRING "Level of tests to compile.")
+
+set(CUTLASS_TEST_ENABLE_CACHED_RESULTS ON CACHE BOOL "Enable caching and reuse of test results in unit tests")
+
+set_property(CACHE CUTLASS_TEST_LEVEL PROPERTY STRINGS 0 1 2)
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_TEST_LEVEL=${CUTLASS_TEST_LEVEL})
+list(APPEND CUTLASS_CUDA_CLANG_FLAGS -DCUTLASS_TEST_LEVEL=${CUTLASS_TEST_LEVEL})
+
+if (CUTLASS_TEST_ENABLE_CACHED_RESULTS)
+  message(STATUS "Enable caching of reference results in conv unit tests")
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1)
+endif()
+
+set(CUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED ON CACHE BOOL "Enable/Disable rigorous conv problem sizes in conv unit tests")
+
+if (CUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED)
+  message(STATUS "Enable rigorous conv problem sizes in conv unit tests")
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1)
+endif()
+
+#
+# CUDA 10.1 introduces "mma" in PTX performing collective matrix multiply operations.
+#
+
+if (CUDA_VERSION VERSION_LESS 10.1)
+  set(CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT OFF)
 else()
-  string(APPEND NVCC_FLAGS " -lineinfo")
+  set(CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT ON)
 endif()

-if (UNIX)
-  string(APPEND NVCC_FLAGS " -Xcompiler -Wconversion")
+# Trace levels for debugging
+set(CUTLASS_DEBUG_TRACE_LEVEL "0" CACHE STRING "Level of debug tracing to perform.")
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_DEBUG_TRACE_LEVEL=${CUTLASS_DEBUG_TRACE_LEVEL})
+
+set(CUTLASS_ENABLE_TENSOR_CORE_MMA ${CUTLASS_ENABLE_TENSOR_CORE_MMA_DEFAULT} CACHE BOOL
+  "Enable PTX mma instruction for collective matrix multiply operations.")
+
+#
+# NOTE: running with asan and CUDA requires the following environment variable:
+#
+#  ASAN_OPTIONS=protect_shadow_gap=0:replace_intrin=0:detect_leaks=0
+#
+# without the above environment setting, an error like the following may be generated:
+#
+#  *** Error: Could not detect active GPU device ID [out of memory]
+#  ...
+#  ==9149==ERROR: LeakSanitizer: detected memory leaks
+#  ...
+#
+if(ENABLE_ASAN)  # https://github.com/google/sanitizers/wiki/AddressSanitizer
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --compiler-options=-fsanitize=address --compiler-options=-fno-omit-frame-pointer)
+  string(APPEND CMAKE_EXE_LINKER_FLAGS " -fsanitize=address")
 endif()

-string(APPEND NVCC_FLAGS_DEBUG " -g")
-string(APPEND NVCC_FLAGS_RELEASE " -O3")
+###################################################################################################
+#
+# Configure CUDA build options
+#
+###################################################################################################

-# define NDEBUG for release mode to disable assertions
-string(APPEND NVCC_FLAGS_RELEASE " -DNDEBUG")
+if(CUTLASS_NVCC_EMBED_PTX)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-include-ptx=all)
+endif()

-if (CUTLASS_NATIVE_CUDA)
-  set(CMAKE_CUDA_FLAGS "${NVCC_FLAGS}")
-  set(CMAKE_CUDA_FLAGS_DEBUG "${NVCC_FLAGS_DEBUG}")
-  set(CMAKE_CUDA_FLAGS_RELEASE "${NVCC_FLAGS_RELEASE}")
+if (CUTLASS_ENABLE_TENSOR_CORE_MMA)
+  list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
+endif()
+
+
+if (NOT MSVC AND CUTLASS_NVCC_KEEP)
+  # MSVC flow handles caching already, but for other generators we handle it here.
+  set(CUTLASS_NVCC_KEEP_DIR ${CMAKE_CURRENT_BINARY_DIR}/tmp CACHE PATH "Location to store NVCC scratch files")
+  file(MAKE_DIRECTORY ${CUTLASS_NVCC_KEEP_DIR})
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS --keep -v) # --keep-dir may not work with nvcc for some directories.
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -save-temps=${CUTLASS_NVCC_KEEP_DIR})
+endif()
+
+if (CUTLASS_ENABLE_F16C AND NOT CMAKE_CROSSCOMPILING)
+  list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_F16C=1)
+  if ((CMAKE_CXX_COMPILER_ID MATCHES "GNU") OR (CMAKE_CXX_COMPILER_ID MATCHES "Clang"))
+    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=-mf16c)
+  elseif((CMAKE_CXX_COMPILER_ID MATCHES "MSVC"))
+    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=/arch:AVX2)
+  endif()
+endif()
+
+if (CUTLASS_ENABLE_OPENMP_TESTS)
+  find_package(OpenMP)
+  if(OpenMP_CXX_FOUND)
+    list(APPEND CUTLASS_CUDA_NVCC_FLAGS -Xcompiler=${OpenMP_CXX_FLAGS})
+  else()
+    message(WARNING "CUTLASS_ENABLE_OPENMP_TESTS set but OpenMP not found.")
+  endif()
+endif()
+
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS $<$<BOOL:${UNIX}>:-Xcompiler=-Wconversion>)
+list(APPEND CUTLASS_CUDA_NVCC_FLAGS $<$<BOOL:${UNIX}>:-Xcompiler=-fno-strict-aliasing>)
+
+# Don't leak lineinfo in release builds
+if (NOT CMAKE_BUILD_TYPE MATCHES "Release")
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -gmlt)
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -lineinfo)
+endif()
+
+#Report CUDA build flags
+if (CUDA_COMPILER MATCHES "[Cc]lang")
+  if(CUTLASS_CUDA_CLANG_FLAGS)
+    message(STATUS "Using CLANG flags: ${CUTLASS_CUDA_CLANG_FLAGS}")
+  endif()
 else()
-  set(CUDA_NVCC_FLAGS ${NVCC_FLAGS})
-  set(CUDA_NVCC_FLAGS_DEBUG ${NVCC_FLAGS_DEBUG})
-  set(CUDA_NVCC_FLAGS_RELEASE ${NVCC_FLAGS_RELEASE})
+  if(CUTLASS_CUDA_NVCC_FLAGS)
+    message(STATUS "Using NVCC flags: ${CUTLASS_CUDA_NVCC_FLAGS}")
+  endif()
 endif()

+if(CUDA_COMPILER MATCHES "[Cc]lang")
+  if( NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang" )
+    message(FATAL_ERROR "Clang CUDA compilation requires Clang CXX compilation. Currently CMAKE_CXX_COMPILER is ${CMAKE_CXX_COMPILER_ID}" )
+  endif()
+
+  # There are numerous Clang versions that can work with each CUDA toolkit and the 
+  # the checks are not very useful so we are turning them off and using testing to 
+  # ensure the various combinations work properly.
+
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-path=${CUDA_TOOLKIT_ROOT_DIR})
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -D__NV_NO_HOST_COMPILER_CHECK=1)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -Wno-unknown-cuda-version)
+
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -mllvm -pragma-unroll-threshold=100000)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -mllvm -unroll-threshold=5000)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -Wno-unused-command-line-argument)
+
+  string(REPLACE "." ";" CUDA_VERSION_PARTS ${CMAKE_CUDA_COMPILER_VERSION})
+  list(GET CUDA_VERSION_PARTS 0 CUDA_VERSION_MAJOR)
+  list(GET CUDA_VERSION_PARTS 1 CUDA_VERSION_MINOR)
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -D__CUDACC_VER_MAJOR__=${CUDA_VERSION_MAJOR} -D__CUDACC_VER_MINOR__=${CUDA_VERSION_MINOR})
+
+
+  # needed for libcublasLt.so in case it's installed in the same location as libcudart.so
+  # dynamic linker can find it if linker sets RPATH (forced by --disable-new-tags)
+  # Otherwise linker uses RUNPATH and that does not propagate to loaded libs.
+  list(APPEND CUTLASS_CUDA_CLANG_FLAGS -Wl,--disable-new-dtags)
+
+  link_libraries(nvidia::cudart)
+  link_libraries(nvidia::cuda_driver)
+endif()
+
+# Support for 128-bit integers if using NVIDIA C++ compiler 
+if (${CMAKE_CXX_COMPILER_ID} MATCHES "PGI" OR ${CMAKE_CXX_COMPILER_ID} MATCHES "NVHPC")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Mint128 ")
+endif()
+
+if (CMAKE_VERSION VERSION_GREATER_EQUAL 3.18)
+  # CMake 3.18 added support for CUDA_ARCHITECTURES target property. We will use this
+  # property for CMake 3.18+, so we request the NEW behavior for correct compatibility.
+  # https://cmake.org/cmake/help/v3.18/policy/CMP0104.html#policy:CMP0104 
+  cmake_policy(SET CMP0104 NEW)
+endif()
+
+function(cutlass_apply_cuda_gencode_flags TARGET)
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs SM_ARCHS)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if (__SM_ARCHS)
+    set(ARCHS_ENABLED ${__SM_ARCHS})
+  else()
+    set(ARCHS_ENABLED ${CUTLASS_NVCC_ARCHS_ENABLED})
+  endif()
+
+  set(NVCC_FLAGS)
+  set(CLANG_FLAGS)
+  set(__CMAKE_CUDA_ARCHS)
+  foreach(ARCH ${ARCHS_ENABLED})
+    list(APPEND CLANG_FLAGS --cuda-gpu-arch=sm_${ARCH})
+    set(CODES)
+    if(CUTLASS_NVCC_EMBED_CUBIN)
+      list(APPEND CODES sm_${ARCH})
+      list(APPEND __CMAKE_CUDA_ARCHS ${ARCH}-real)
+    endif()
+    if(CUTLASS_NVCC_EMBED_PTX)
+      list(APPEND CODES compute_${ARCH})
+      list(APPEND __CMAKE_CUDA_ARCHS ${ARCH}-virtual)
+    endif()
+    list(JOIN CODES "," CODES_STR)
+    list(APPEND NVCC_FLAGS -gencode=arch=compute_${ARCH},code=[${CODES_STR}])
+  endforeach()
+ 
+  if (NOT __SM_ARCHS)
+    if (CUDA_COMPILER MATCHES "[Cc]lang")
+      target_compile_options(
+        ${TARGET}
+        PRIVATE
+        $<$<COMPILE_LANGUAGE:CXX>:${CLANG_FLAGS}>
+        )
+    elseif(CMAKE_VERSION GREATER_EQUAL 3.18)
+      set_property(TARGET ${TARGET} PROPERTY CUDA_ARCHITECTURES ${__CMAKE_CUDA_ARCHS})
+    else()
+      target_compile_options(
+        ${TARGET}
+        PRIVATE
+        $<$<COMPILE_LANGUAGE:CUDA>:${NVCC_FLAGS}>
+        )
+    endif()
+  else()
+    list(JOIN CLANG_FLAGS " " CLANG_FLAGS_STR)
+    list(JOIN NVCC_FLAGS " " STR_NVCC_FLAGS)
+    if (CUDA_COMPILER MATCHES "[Cc]lang")
+      if(${TARGET} MATCHES ".*\.cpp")
+        set_source_files_properties(${TARGET} PROPERTIES COMPILE_FLAGS ${CLANG_FLAGS_STR})
+      endif()
+    elseif(CMAKE_VERSION GREATER_EQUAL 3.18)
+      set_source_files_properties(${TARGET} PROPERTIES CUDA_ARCHITECTURES ${STR_NVCC_FLAGS})
+    else()
+      if(${TARGET} MATCHES ".*\.cu")
+        set_source_files_properties(${TARGET} PROPERTIES COMPILE_FLAGS ${STR_NVCC_FLAGS})
+      endif()
+    endif()
+  endif()
+
+endfunction()
+
+# Cache the flags so they are available when the function below is called anywhere globally. 
+
+set(__CUTLASS_CUDA_FLAGS ${CUTLASS_CUDA_FLAGS} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_FLAGS_RELEASE ${CUTLASS_CUDA_FLAGS_RELEASE} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_FLAGS_RELWITHDEBINFO ${CUTLASS_CUDA_FLAGS_RELWITHDEBINFO} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_FLAGS_DEBUG ${CUTLASS_CUDA_FLAGS_DEBUG} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_CLANG_FLAGS ${CUTLASS_CUDA_CLANG_FLAGS} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_CLANG_FLAGS_RELEASE ${CUTLASS_CUDA_CLANG_FLAGS_RELEASE} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_CLANG_FLAGS_RELWITHDEBINFO ${CUTLASS_CUDA_CLANG_FLAGS_RELWITHDEBINFO} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_CLANG_FLAGS_DEBUG ${CUTLASS_CUDA_CLANG_FLAGS_DEBUG} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_NVCC_FLAGS ${CUTLASS_CUDA_NVCC_FLAGS} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_NVCC_FLAGS_RELEASE ${CUTLASS_CUDA_NVCC_FLAGS_RELEASE} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_NVCC_FLAGS_RELWITHDEBINFO ${CUTLASS_CUDA_NVCC_FLAGS_RELWITHDEBINFO} CACHE INTERNAL "")
+set(__CUTLASS_CUDA_NVCC_FLAGS_DEBUG ${CUTLASS_CUDA_NVCC_FLAGS_DEBUG} CACHE INTERNAL "")
+
+function(cutlass_apply_standard_compile_options TARGET)
+
+  if(CUDA_COMPILER MATCHES "[Cc]lang")
+    set(CUDA_COMPILE_LANGUAGE CXX)
+    set(_FLAGS ${__CUTLASS_CUDA_FLAGS} ${__CUTLASS_CUDA_CLANG_FLAGS})
+    set(_FLAGS_RELEASE ${__CUTLASS_CUDA_FLAGS_RELEASE} ${__CUTLASS_CUDA_CLANG_FLAGS_RELEASE})
+    set(_FLAGS_RELWITHDEBINFO ${__CUTLASS_CUDA_FLAGS_RELWITHDEBINFO} ${__CUTLASS_CUDA_CLANG_FLAGS_RELWITHDEBINFO})
+    set(_FLAGS_DEBUG ${__CUTLASS_CUDA_FLAGS_DEBUG} ${__CUTLASS_CUDA_CLANG_FLAGS_DEBUG})
+  else()
+    set(CUDA_COMPILE_LANGUAGE CUDA)
+    set(_FLAGS ${__CUTLASS_CUDA_FLAGS} ${__CUTLASS_CUDA_NVCC_FLAGS})
+    set(_FLAGS_RELEASE ${__CUTLASS_CUDA_FLAGS_RELEASE} ${__CUTLASS_CUDA_NVCC_FLAGS_RELEASE})
+    set(_FLAGS_RELWITHDEBINFO ${__CUTLASS_CUDA_FLAGS_RELWITHDEBINFO} ${__CUTLASS_CUDA_NVCC_FLAGS_RELWITHDEBINFO})
+    set(_FLAGS_DEBUG ${__CUTLASS_CUDA_FLAGS_DEBUG} ${__CUTLASS_CUDA_NVCC_FLAGS_DEBUG})
+  endif()
+
+  target_link_libraries(${TARGET} PRIVATE CUTLASS)
+
+  target_compile_options(
+    ${TARGET}
+    PRIVATE
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:${_FLAGS}>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:RELEASE>:${_FLAGS_RELEASE}>>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:RELWITHDEBINFO>:${_FLAGS_RELWITHDEBINFO}>>
+    $<$<COMPILE_LANGUAGE:${CUDA_COMPILE_LANGUAGE}>:$<$<CONFIG:DEBUG>:${_FLAGS_DEBUG}>>
+    )
+
+endfunction()
+
 #
 # The following items should eventually be pushed into cutlass/CMakeLists.txt
 #

 # GLOB for CUTLASS header files. Should we use a static list instead?
-file(GLOB CUTLASS_GEMM RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} cutlass/gemm/*.h)
-file(GLOB CUTLASS_UTIL RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} cutlass/util/*.h)
-file(GLOB CUTLASS_DEVICE RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} cutlass/device/*.h)
-file(GLOB CUTLASS_CORE RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} cutlass/*.h)
+file(GLOB_RECURSE CUTLASS_INCLUDE RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} include/cutlass/*.h)
+file(GLOB_RECURSE CUTLASS_CUTLASS RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}/include include/cutlass/*.h)
+file(GLOB_RECURSE CUTLASS_NVRTC RELATIVE ${CMAKE_CURRENT_SOURCE_DIR}/test test/unit/nvrtc/kernel/*.h)

-source_group("cutlass\\gemm" FILES ${CUTLASS_GEMM})
-source_group("cutlass\\util" FILES ${CUTLASS_UTIL})
-source_group("cutlass\\device" FILES ${CUTLASS_DEVICE})
-source_group("cutlass" FILES ${CUTLASS_CORE})
+###################################################################################################
+#
+# Define build targets
+#
+###################################################################################################
+
+source_group(TREE ${CMAKE_CURRENT_SOURCE_DIR}/include REGULAR_EXPRESSION ".*\.h")

 add_library(CUTLASS INTERFACE)
-include_directories("${CMAKE_CURRENT_SOURCE_DIR}")
-target_sources(CUTLASS INTERFACE
-  ${CUTLASS_GEMM}
-  ${CUTLASS_UTIL}
-  ${CUTLASS_DEVICE}
-  ${CUTLASS_CORE}
-)
+add_library(nvidia::cutlass::cutlass ALIAS CUTLASS)
+set_target_properties(CUTLASS PROPERTIES EXPORT_NAME cutlass)

-target_include_directories(CUTLASS INTERFACE ${CMAKE_CURRENT_SOURCE_DIR})
+set(CUTLASS_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/include CACHE PATH "CUTLASS Header Library")
+
+set(CUTLASS_GENERATOR_DIR ${CMAKE_CURRENT_SOURCE_DIR}/tools/library CACHE INTERNAL "Location of generator scripts")
+
+# The following utility directory is needed even if the tools build is disabled, so it exists here.
+set(CUTLASS_TOOLS_UTIL_INCLUDE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/tools/util/include CACHE INTERNAL "")
+
+include_directories(${CUTLASS_INCLUDE_DIR})
+
+target_compile_features(CUTLASS INTERFACE cxx_std_11)
+
+if (NOT CUTLASS_NAMESPACE STREQUAL "cutlass")
+  target_compile_definitions(CUTLASS INTERFACE CUTLASS_NAMESPACE=${CUTLASS_NAMESPACE})
+endif()
+
+if (NOT DEFINED CUTLASS_REVISION)
+
+  find_package(Git QUIET)
+
+  execute_process(
+    COMMAND ${GIT_EXECUTABLE} rev-parse --short HEAD
+    RESULT_VARIABLE CUTLASS_REVISION_RESULT
+    OUTPUT_VARIABLE CUTLASS_REVISION
+    OUTPUT_STRIP_TRAILING_WHITESPACE
+  )
+
+  if (CUTLASS_REVISION_RESULT)
+    message(STATUS "CUTLASS Revision: Unable to detect, Git returned code ${CUTLASS_REVISION_RESULT}.")
+  else()
+    message(STATUS "CUTLASS Revision: ${CUTLASS_REVISION}")
+  endif()
+
+endif()
+
+configure_file(
+  ${CMAKE_CURRENT_SOURCE_DIR}/cmake/version.h.in
+  ${CMAKE_CURRENT_BINARY_DIR}/include/cutlass/version.h
+  @ONLY)
+
+target_include_directories(
+  CUTLASS
+  INTERFACE
+  $<INSTALL_INTERFACE:include>
+  $<BUILD_INTERFACE:${CUTLASS_INCLUDE_DIR}>
+  $<BUILD_INTERFACE:${CMAKE_CURRENT_BINARY_DIR}/include>
+  $<BUILD_INTERFACE:${CUDA_TOOLKIT_ROOT_DIR}/include>
+  $<BUILD_INTERFACE:${cute_SOURCE_DIR}/include>
+  $<BUILD_INTERFACE:${cute_SOURCE_DIR}/examples>
+  )
+
+install(
+  DIRECTORY
+  ${CUTLASS_INCLUDE_DIR}/
+  ${CMAKE_CURRENT_BINARY_DIR}/include/
+  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  )
+
+install(
+  TARGETS CUTLASS
+  EXPORT NvidiaCutlass
+  PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+  )
+
+################################################################################

-# Create a custom target to ensure that the CUTLASS sources are visible in an IDE
-add_custom_target(cutlass_ide SOURCES
-  ${CUTLASS_GEMM}
-  ${CUTLASS_UTIL}
-  ${CUTLASS_DEVICE}
-  ${CUTLASS_CORE}
-)
 # Doxygen is available. Generate documentation
 if (DOXYGEN_FOUND)
    # DOT is available. Enable graph generation in the documentation
    if (DOXYGEN_DOT_EXECUTABLE)
-        set(CUTLASS_ENABLE_DOXYGEN_DOT              ON            CACHE BOOL "Use dot to generate graphs in the doxygen documentation.")
+        set(CUTLASS_ENABLE_DOXYGEN_DOT ON CACHE BOOL "Use dot to generate graphs in the doxygen documentation.")
    else()
-        set(CUTLASS_ENABLE_DOXYGEN_DOT              OFF            CACHE BOOL "Use dot to generate graphs in the doxygen documentation." FORCE)
+        set(CUTLASS_ENABLE_DOXYGEN_DOT OFF CACHE BOOL "Use dot to generate graphs in the doxygen documentation." FORCE)
    endif()

    if (CUTLASS_ENABLE_DOXYGEN_DOT)
@ -177,6 +571,270 @@ if (DOXYGEN_FOUND)
    )
 endif()

+if(NOT WIN32)
+  # Add common library search paths so executables and libraries can load and run
+  # without LD_LIBRARY_PATH being set.
+  link_libraries(
+    "-Wl,-rpath,'$ORIGIN'"
+    "-Wl,-rpath,'$ORIGIN/../lib64'"
+    "-Wl,-rpath,'$ORIGIN/../lib'"
+    "-Wl,-rpath,'${CUDA_TOOLKIT_ROOT_DIR}/lib64'"
+    "-Wl,-rpath,'${CUDA_TOOLKIT_ROOT_DIR}/lib'"
+    )
+endif()

-#add_subdirectory(examples/gemm)
-add_subdirectory(tools)
+################################################################################
+
+include(CTest)
+enable_testing()
+if (NOT TARGET test_all)
+  add_custom_target(test_all)
+endif()
+
+set(CUTLASS_INSTALL_TESTS ON CACHE BOOL "Install test executables")
+set(CUTLASS_TEST_EXECUTION_ENVIRONMENT "" CACHE BOOL "Environment in which to invoke unit test executables")
+
+set(CMAKE_TEST_INSTALL_PREFIX test CACHE STRING "Test root install location, relative to CMAKE_INSTALL_PREFIX.")
+set(CUTLASS_TEST_INSTALL_PREFIX ${CMAKE_TEST_INSTALL_PREFIX}/cutlass CACHE STRING "Test root install location, relative to CMAKE_INSTALL_PREFIX.")
+set(CUTLASS_TEST_INSTALL_BINDIR ${CUTLASS_TEST_INSTALL_PREFIX}/${CMAKE_INSTALL_BINDIR} CACHE STRING "Test root install location, relative to CMAKE_INSTALL_PREFIX.")
+set(CUTLASS_TEST_INSTALL_LIBDIR ${CUTLASS_TEST_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR} CACHE STRING "Test root install location, relative to CMAKE_INSTALL_PREFIX.")
+
+install(DIRECTORY DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX})
+install(DIRECTORY DESTINATION ${CUTLASS_TEST_INSTALL_BINDIR})
+install(DIRECTORY DESTINATION ${CUTLASS_TEST_INSTALL_LIBDIR})
+install(DIRECTORY DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX}/ctest)
+
+################################################################################
+
+set(CUTLASS_ENABLE_CUBLAS OFF CACHE BOOL "cuBLAS usage for tests")
+set(CUTLASS_ENABLE_CUDNN OFF CACHE BOOL "cuDNN usage for tests")
+
+include(${CMAKE_CURRENT_SOURCE_DIR}/cuBLAS.cmake)
+
+if (CUTLASS_ENABLE_CUBLAS)
+  target_compile_definitions(CUTLASS INTERFACE CUTLASS_ENABLE_CUBLAS=1)
+endif()
+
+include(${CMAKE_CURRENT_SOURCE_DIR}/cuDNN.cmake)
+
+if (CUTLASS_ENABLE_CUDNN)
+  target_compile_definitions(CUTLASS INTERFACE CUTLASS_ENABLE_CUDNN=1)
+endif()
+
+################################################################################
+
+set(CUTLASS_CTEST_TEMPLATE_FILE ${CMAKE_CURRENT_LIST_DIR}/cmake/CTestTestfile.config.cmake)
+set(CUTLASS_CTEST_GENERATED_FILES "" CACHE INTERNAL "")
+
+function(cutlass_add_executable_tests NAME TARGET)
+# 
+# Generates test rules for `make test`, `make test_all`, and `ctest` invoked from either the 
+# <CMAKE_BINARY_DIR> or the <CMAKE_INSTALL_PREFIX>/<CUTLASS_TEST_INSTALL_PREFIX> after installation.
+# 
+# NAME: The base name for the test. Can be run with `make <NAME>` or `ctest -R 'c<NAME>'`.
+# TARGET: The target corresponding to the executable under test.
+# DISABLE_EXECUTABLE_INSTALL_RULE: An option, if given, that disables creating an install rule for TARGET.
+# DEPENDS: A list of targets or files on which this test is dependent.
+# DEPENDEES: A list of targets which should depend on this test.
+# TEST_COMMAND_OPTIONS: A list of variables (i.e. by reference params) which contain command line arguments
+#   to pass to the test executable. A unique test with suffix _0, _1, ... is generated for each set of 
+#   options given. If this option is not used, a single test with no arguments is generated.
+# RESULT_CACHE_FILE: A file to be installed alongside the test executable with pre-computed
+#   test results to speed up test runtime.
+# 
+
+  set(options DISABLE_EXECUTABLE_INSTALL_RULE)
+  set(oneValueArgs DISABLE_TESTS RESULT_CACHE_FILE)
+  set(multiValueArgs DEPENDS DEPENDEES TEST_COMMAND_OPTIONS)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if (NOT DEFINED __DISABLE_TESTS)
+    set(__DISABLE_TESTS OFF)
+  endif()
+
+  if (__RESULT_CACHE_FILE)
+
+    add_custom_command(
+      TARGET ${TARGET}
+      POST_BUILD
+      COMMAND ${CMAKE_COMMAND}
+      ARGS -E copy ${__RESULT_CACHE_FILE} "$<TARGET_FILE_DIR:${TARGET}>"
+      )
+
+  endif()
+
+  if (NOT __DISABLE_EXECUTABLE_INSTALL_RULE AND CUTLASS_INSTALL_TESTS)
+  
+    # file(RELATIVE_PATH CMAKE_CURRENT_BINARY_RELATIVE_DIR ${CMAKE_BINARY_DIR} ${CMAKE_CURRENT_BINARY_DIR})
+  
+    install(
+      TARGETS ${TARGET}
+      RUNTIME DESTINATION ${CUTLASS_TEST_INSTALL_BINDIR}
+      )
+
+    if (__RESULT_CACHE_FILE)
+
+     install(
+       FILES ${__RESULT_CACHE_FILE}
+       DESTINATION ${CUTLASS_TEST_INSTALL_BINDIR}/
+       )
+
+    endif()
+  
+  endif()
+
+  if (NOT __TEST_COMMAND_OPTIONS)
+    set(__TEST_COMMAND_OPTIONS " ")
+  endif()
+
+  list(LENGTH __TEST_COMMAND_OPTIONS CMD_COUNT)
+  set(CMD_IDX 0)
+
+  if (CMD_COUNT GREATER 1)
+    add_custom_target(${NAME} DEPENDS ${TARGET} ${__DEPENDS})
+    foreach(DEPENDEE ${__DEPENDEES})
+      add_dependencies(${DEPENDEE} ${NAME})
+    endforeach()
+  endif()
+
+  foreach(CMD_OPTIONS ${__TEST_COMMAND_OPTIONS})
+
+    if (CMD_COUNT GREATER 1)
+      set(TEST_NAME ${NAME}_${CMD_IDX})
+    else()
+      set(TEST_NAME ${NAME})
+    endif()
+
+    # The following rigmarole is needed to deal with spaces and possible quotes in 
+    # command line arguments. The options are passed "by reference" as the actual
+    # variable names holding the real options. We then expand these in a way that
+    # preserves any quotes. Note, they have to be in this order for it to work for 
+    # all the use cases below.
+
+    set(CMD_OPTIONS ${${CMD_OPTIONS}})
+    list(JOIN CMD_OPTIONS " " TEST_COMMAND_OPTIONS)
+    separate_arguments(CMD_OPTIONS)
+   
+    add_custom_target(
+      ${TEST_NAME}
+      COMMAND
+      ${CUTLASS_TEST_EXECUTION_ENVIRONMENT} $<TARGET_FILE:${TARGET}> ${CMD_OPTIONS}
+      DEPENDS
+      ${TARGET}
+      )
+
+    if (CMD_COUNT GREATER 1)
+      add_dependencies(${NAME} ${TEST_NAME})
+    endif()
+
+    foreach(DEPENDEE ${__DEPENDEES})
+      add_dependencies(${DEPENDEE} ${TEST_NAME})
+    endforeach()
+
+    add_test(
+      NAME c${TEST_NAME}
+      COMMAND ${CUTLASS_TEST_EXECUTION_ENVIRONMENT} $<TARGET_FILE:${TARGET}> ${CMD_OPTIONS}
+      )
+
+    set_tests_properties(c${TEST_NAME} PROPERTIES DISABLED ${__DISABLE_TESTS})
+
+    if (CUTLASS_INSTALL_TESTS)
+
+      # To run the tests from an install package with tests enabled, we need to generate test files
+      # that don't rely on the current directory structure in build.  
+
+      set(TEST_NAME c${TEST_NAME})
+      set(TEST_EXE $<TARGET_FILE_NAME:${TARGET}>)
+      set(TEST_EXE_WORKING_DIRECTORY ./${CMAKE_INSTALL_BINDIR})
+      configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${CMAKE_PROJECT_DIR}${CMAKE_CURRENT_BINARY_DIR}/CTestTestfile.${TEST_NAME}.config.cmake" @ONLY)
+
+      file(GENERATE 
+        OUTPUT "${CMAKE_PROJECT_DIR}${CMAKE_CURRENT_BINARY_DIR}/CTestTestfile.${TEST_NAME}.cmake" 
+        INPUT "${CMAKE_PROJECT_DIR}${CMAKE_CURRENT_BINARY_DIR}/CTestTestfile.${TEST_NAME}.config.cmake"
+        )
+  
+      install(
+        FILES "${CMAKE_PROJECT_DIR}${CMAKE_CURRENT_BINARY_DIR}/CTestTestfile.${TEST_NAME}.cmake"
+        DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX}/ctest/
+        )
+  
+      set(CUTLASS_CTEST_GENERATED_FILES ${CUTLASS_CTEST_GENERATED_FILES};ctest/CTestTestfile.${TEST_NAME}.cmake CACHE INTERNAL "")
+  
+    endif()
+
+    math(EXPR CMD_IDX "${CMD_IDX} + 1")
+
+  endforeach()
+
+endfunction()
+
+if (CUTLASS_ENABLE_TOOLS)
+  add_subdirectory(tools)
+  if (CUTLASS_ENABLE_PROFILER)
+    add_dependencies(test_all test_profiler)
+  endif()
+endif()
+if (CUTLASS_ENABLE_EXAMPLES)
+  add_subdirectory(examples)
+  add_dependencies(test_all test_examples)
+endif()
+
+if (CUTLASS_ENABLE_TESTS)
+  add_subdirectory(test)
+  add_dependencies(test_all test_unit)
+endif()
+
+if (CUTLASS_INSTALL_TESTS)
+
+  file(MAKE_DIRECTORY "${CMAKE_BINARY_DIR}/cmake")
+
+  file(WRITE "${CMAKE_BINARY_DIR}/cmake/CTestTestfile.cmake" "# Generated File\n")
+  foreach(GENERATED_FILE ${CUTLASS_CTEST_GENERATED_FILES})
+    file(APPEND "${CMAKE_BINARY_DIR}/cmake/CTestTestfile.cmake" "include(${GENERATED_FILE})\n")
+  endforeach()
+
+  install(
+    FILES "${CMAKE_BINARY_DIR}/cmake/CTestTestfile.cmake"
+    DESTINATION "${CUTLASS_TEST_INSTALL_PREFIX}/"
+    )
+
+endif()
+
+#? install(
+#?   FILES ${CMAKE_BINARY_DIR}/CTestTestfile.cmake
+#?   DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX}/
+#?   )
+#? 
+#? install(
+#?   DIRECTORY 
+#?     ${CMAKE_BINARY_DIR}/tools
+#?     ${CMAKE_BINARY_DIR}/test
+#?   DESTINATION ${CUTLASS_TEST_INSTALL_PREFIX}/
+#?   FILES_MATCHING PATTERN "CTestTestfile.cmake"
+#?   )
+
+################################################################################
+
+include(CMakePackageConfigHelpers)
+
+write_basic_package_version_file(
+  ${CMAKE_CURRENT_BINARY_DIR}/NvidiaCutlassConfigVersion.cmake 
+  COMPATIBILITY AnyNewerVersion)
+
+install(
+  FILES 
+    ${CMAKE_CURRENT_SOURCE_DIR}/cmake/NvidiaCutlassConfig.cmake
+    ${CMAKE_CURRENT_BINARY_DIR}/NvidiaCutlassConfigVersion.cmake 
+  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/NvidiaCutlass/
+  )
+
+install(
+  EXPORT NvidiaCutlass
+  NAMESPACE nvidia::cutlass::
+  DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/NvidiaCutlass/
+  FILE NvidiaCutlassTargets.cmake
+  )
+
+################################################################################
+
+include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/NvidiaCutlassPackageConfig.cmake)
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -0,0 +1,83 @@
+![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
+
+[README](/README.md#documentation) > **Contributors**
+
+# CUTLASS Developers and Contributors
+
+This is the official list of CUTLASS developers and contributors.
+
+## DEVELOPERS
+Vijay Thakkar<br />
+Pradeep Ramani<br />
+Cris Cecka<br />
+Aniket Shivam<br />
+Jack Kosaian<br />
+Mark Hoemmen<br />
+Honghao Lu<br />
+Ethan Yan<br />
+Haicheng Wu<br />
+Andrew Kerr<br />
+Dustyn Blasig<br />
+Fengqi Qiao<br />
+Duane Merrill<br />
+Yujia Zhai<br />
+Shang Zhang<br />
+Piotr Majcher<br />
+Paul Springer<br />
+Markus Hohnerbach<br />
+Jin Wang<br />
+Aditya Atluri<br />
+
+## CuTe
+Cris Cecka<br />
+Vijay Thakkar<br />
+
+## CUTLASS Product Manager
+Matthew Nicely<br />
+
+## Former CUTLASS Developers
+Manish Gupta<br />
+Naila Farooqui<br />
+David Tanner<br />
+Manikandan Ananth<br />
+Zhaodong Chen<br />
+Chinmay Talegaonkar<br />
+
+## CONTRIBUTORS
+Timothy Costa<br />
+Julien Demouth<br />
+Brian Fahs<br />
+Michael Garland<br />
+Michael Goldfarb<br />
+Mostafa Hagog<br />
+Fei Hu<br />
+Alan Kaatz<br />
+Tina Li<br />
+Timmy Liu<br />
+Wei Liu<br />
+Duane Merrill<br />
+Kevin Siu<br />
+Markus Tavenrath<br />
+John Tran<br />
+Vicki Wang<br />
+Junkai Wu<br />
+Fung Xie<br />
+Albert Xu<br />
+Yang Xu<br />
+Jack Yang<br />
+Scott Yokim<br />
+Xiuxia Zhang<br />
+Nick Zhao<br />
+
+## ACKNOWLEDGEMENTS
+
+Girish Bharambe<br />
+Luke Durant<br />
+Carter Edwards<br />
+Olivier Giroux<br />
+Stephen Jones<br />
+Rishkul Kulkarni<br />
+Bryce Lelbach<br />
+Joel McCormack<br />
+Kyrylo Perelygin<br />
+Sean Treichler<br />
--- a/CUDA.cmake
+++ b/CUDA.cmake
@ -0,0 +1,362 @@
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+if(CUDA_COMPILER MATCHES "[Cc]lang")
+  set(CUTLASS_NATIVE_CUDA_INIT ON)
+elseif(CMAKE_VERSION VERSION_LESS 3.12.4)
+  set(CUTLASS_NATIVE_CUDA_INIT OFF)
+else()
+  set(CUTLASS_NATIVE_CUDA_INIT ON)
+endif()
+
+set(CUTLASS_NATIVE_CUDA ${CUTLASS_NATIVE_CUDA_INIT} CACHE BOOL "Utilize the CMake native CUDA flow")
+
+if(NOT DEFINED ENV{CUDACXX} AND NOT DEFINED ENV{CUDA_BIN_PATH} AND DEFINED ENV{CUDA_PATH})
+  # For backward compatibility, allow use of CUDA_PATH.
+  set(ENV{CUDACXX} $ENV{CUDA_PATH}/bin/nvcc)
+endif()
+
+if(CUTLASS_NATIVE_CUDA)
+
+  enable_language(CUDA)
+
+  if(NOT CUDA_VERSION)
+    set(CUDA_VERSION ${CMAKE_CUDA_COMPILER_VERSION})
+  endif()
+  if(NOT CUDA_TOOLKIT_ROOT_DIR)
+    get_filename_component(CUDA_TOOLKIT_ROOT_DIR "${CMAKE_CUDA_COMPILER}/../.." ABSOLUTE)
+  endif()
+
+else()
+
+  find_package(CUDA REQUIRED)
+  # We workaround missing variables with the native flow by also finding the CUDA toolkit the old way.
+
+  if(NOT CMAKE_CUDA_COMPILER_VERSION)
+    set(CMAKE_CUDA_COMPILER_VERSION ${CUDA_VERSION})
+  endif()
+
+endif()
+
+if (CUDA_VERSION VERSION_LESS 9.2)
+  message(FATAL_ERROR "CUDA 9.2+ Required, Found ${CUDA_VERSION}.")
+endif()
+if(NOT CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "[Cc]lang")
+  set(CMAKE_CUDA_COMPILER ${CUDA_TOOLKIT_ROOT_DIR}/bin/nvcc)
+  message(STATUS "CUDA Compiler: ${CMAKE_CUDA_COMPILER}")
+endif()
+
+find_library(
+  CUDART_LIBRARY cudart
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET cudart AND CUDART_LIBRARY)
+
+  message(STATUS "CUDART: ${CUDART_LIBRARY}")
+
+  if(WIN32)
+    add_library(cudart STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(cudart SHARED IMPORTED GLOBAL)
+  endif()  
+
+  add_library(nvidia::cudart ALIAS cudart)
+  
+  set_property(
+    TARGET cudart
+    PROPERTY IMPORTED_LOCATION
+    ${CUDART_LIBRARY}
+    )
+
+elseif(TARGET cudart)
+
+  message(STATUS "CUDART: Already Found")
+
+else()
+
+  message(STATUS "CUDART: Not Found")
+
+endif()
+
+find_library(
+  CUDA_DRIVER_LIBRARY cuda
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  lib64/stubs
+  lib/stubs
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET cuda_driver AND CUDA_DRIVER_LIBRARY)
+
+  message(STATUS "CUDA Driver: ${CUDA_DRIVER_LIBRARY}")
+
+  if(WIN32)
+    add_library(cuda_driver STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(cuda_driver SHARED IMPORTED GLOBAL)
+  endif()  
+
+  add_library(nvidia::cuda_driver ALIAS cuda_driver)
+  
+  set_property(
+    TARGET cuda_driver
+    PROPERTY IMPORTED_LOCATION
+    ${CUDA_DRIVER_LIBRARY}
+    )
+
+elseif(TARGET cuda_driver)
+
+  message(STATUS "CUDA Driver: Already Found")
+
+else()
+
+  message(STATUS "CUDA Driver: Not Found")
+
+endif()
+
+find_library(
+  NVRTC_LIBRARY nvrtc
+  PATHS
+  ${CUDA_TOOLKIT_ROOT_DIR}
+  PATH_SUFFIXES
+  lib/x64
+  lib64
+  lib
+  NO_DEFAULT_PATH
+  # We aren't going to search any system paths. We want to find the runtime 
+  # in the CUDA toolkit we're building against.
+  )
+
+if(NOT TARGET nvrtc AND NVRTC_LIBRARY)
+
+  message(STATUS "NVRTC: ${NVRTC_LIBRARY}")
+
+  if(WIN32)
+    add_library(nvrtc STATIC IMPORTED GLOBAL)
+    # Even though we're linking against a .dll, in Windows you statically link against
+    # the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
+    # from the PATH search.
+  else()
+    add_library(nvrtc SHARED IMPORTED GLOBAL)
+  endif()  
+  
+  add_library(nvidia::nvrtc ALIAS nvrtc)
+  
+  set_property(
+    TARGET nvrtc
+    PROPERTY IMPORTED_LOCATION
+    ${NVRTC_LIBRARY}
+    )
+
+elseif(TARGET nvrtc)
+
+  message(STATUS "NVRTC: Already Found")
+
+else()
+
+  message(STATUS "NVRTC: Not Found")
+
+endif()
+
+include_directories(SYSTEM ${CUDA_INCLUDE_DIRS})
+# Some platforms (e.g. Visual Studio) don't add the CUDA include directories to the system include
+# paths by default, so we add it explicitly here.
+
+function(cutlass_correct_source_file_language_property)
+  if(CUDA_COMPILER MATCHES "[Cc]lang")
+    foreach(File ${ARGN})
+      if(File MATCHES ".*\.cu$")
+        set_source_files_properties(${File} PROPERTIES LANGUAGE CXX)
+      endif()
+    endforeach()
+  endif()
+endfunction()
+
+if (MSVC OR CUTLASS_LIBRARY_KERNELS MATCHES "all")
+  set(CUTLASS_UNITY_BUILD_ENABLED_INIT ON)
+else()
+  set(CUTLASS_UNITY_BUILD_ENABLED_INIT OFF)
+endif()
+
+set(CUTLASS_UNITY_BUILD_ENABLED ${CUTLASS_UNITY_BUILD_ENABLED_INIT} CACHE BOOL "Enable combined source compilation")
+set(CUTLASS_UNITY_BUILD_BATCH_SIZE 16 CACHE STRING "Batch size for unified source files")
+
+function(cutlass_unify_source_files TARGET_ARGS_VAR)
+
+  set(options)
+  set(oneValueArgs BATCH_SOURCES BATCH_SIZE)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if (NOT DEFINED TARGET_ARGS_VAR)
+    message(FATAL_ERROR "TARGET_ARGS_VAR parameter is required")
+  endif()
+
+  if (__BATCH_SOURCES AND NOT DEFINED __BATCH_SIZE)
+    set(__BATCH_SIZE ${CUTLASS_UNITY_BUILD_BATCH_SIZE})
+  endif()
+
+  if (CUTLASS_UNITY_BUILD_ENABLED AND DEFINED __BATCH_SIZE AND __BATCH_SIZE GREATER 1)
+
+    set(CUDA_FILE_ARGS)
+    set(TARGET_SOURCE_ARGS)
+    
+    foreach(ARG ${__UNPARSED_ARGUMENTS})
+      if(${ARG} MATCHES ".*\.cu$")
+        list(APPEND CUDA_FILE_ARGS ${ARG})
+      else()
+        list(APPEND TARGET_SOURCE_ARGS ${ARG})
+      endif()
+    endforeach()
+    
+    list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
+    while(NUM_CUDA_FILE_ARGS GREATER 0)
+      list(SUBLIST CUDA_FILE_ARGS 0 ${__BATCH_SIZE} CUDA_FILE_BATCH)
+      string(SHA256 CUDA_FILE_BATCH_HASH "${CUDA_FILE_BATCH}")
+      string(SUBSTRING ${CUDA_FILE_BATCH_HASH} 0 12 CUDA_FILE_BATCH_HASH)
+      set(BATCH_FILE ${CMAKE_CURRENT_BINARY_DIR}/${NAME}.unity.${CUDA_FILE_BATCH_HASH}.cu)
+      message(STATUS "Generating ${BATCH_FILE}")
+      file(WRITE ${BATCH_FILE} "// Unity File - Auto Generated!\n")
+      foreach(CUDA_FILE ${CUDA_FILE_BATCH})
+        get_filename_component(CUDA_FILE_ABS_PATH ${CUDA_FILE} ABSOLUTE)
+        file(APPEND ${BATCH_FILE} "#include \"${CUDA_FILE_ABS_PATH}\"\n")
+      endforeach()
+      list(APPEND TARGET_SOURCE_ARGS ${BATCH_FILE})
+      if (NUM_CUDA_FILE_ARGS LESS_EQUAL __BATCH_SIZE)
+        break()
+      endif()
+      list(SUBLIST CUDA_FILE_ARGS ${__BATCH_SIZE} -1 CUDA_FILE_ARGS)
+      list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
+    endwhile()
+
+  else()
+
+    set(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  endif()
+
+  set(${TARGET_ARGS_VAR} ${TARGET_SOURCE_ARGS} PARENT_SCOPE)
+
+endfunction()
+function(cutlass_add_library NAME)
+
+  set(options SKIP_GENCODE_FLAGS)
+  set(oneValueArgs EXPORT_NAME)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  if(CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "clang")
+    cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+    add_library(${NAME} ${TARGET_SOURCE_ARGS})
+  else()
+    set(CUDA_LINK_LIBRARIES_KEYWORD PRIVATE)
+    cuda_add_library(${NAME} ${TARGET_SOURCE_ARGS})
+  endif()
+
+  cutlass_apply_standard_compile_options(${NAME})
+  if (NOT __SKIP_GENCODE_FLAGS)
+  cutlass_apply_cuda_gencode_flags(${NAME})
+  endif()
+
+  target_compile_features(
+   ${NAME}
+   INTERFACE
+   cxx_std_11
+   )
+
+  if(__EXPORT_NAME)
+    add_library(nvidia::cutlass::${__EXPORT_NAME} ALIAS ${NAME})
+    set_target_properties(${NAME} PROPERTIES EXPORT_NAME ${__EXPORT_NAME})
+  endif()
+
+endfunction()
+
+function(cutlass_add_executable NAME)
+
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+
+  if(CUTLASS_NATIVE_CUDA OR CUDA_COMPILER MATCHES "clang")
+    cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+    add_executable(${NAME} ${TARGET_SOURCE_ARGS})
+  else()
+    set(CUDA_LINK_LIBRARIES_KEYWORD PRIVATE)
+    cuda_add_executable(${NAME} ${TARGET_SOURCE_ARGS})
+  endif()
+
+  cutlass_apply_standard_compile_options(${NAME})
+  cutlass_apply_cuda_gencode_flags(${NAME})
+
+  target_compile_features(
+   ${NAME}
+   INTERFACE
+   cxx_std_11
+   )
+
+endfunction()
+
+function(cutlass_target_sources NAME)
+
+  set(options)
+  set(oneValueArgs)
+  set(multiValueArgs)
+  cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
+  cutlass_correct_source_file_language_property(${TARGET_SOURCE_ARGS})
+  target_sources(${NAME} ${TARGET_SOURCE_ARGS})
+
+endfunction()
--- a/24
+++ b/24
@ -32,7 +32,7 @@ DOXYFILE_ENCODING      = UTF-8
 # title of most generated pages and in a few other places.
 # The default value is: My Project.

-PROJECT_NAME           = "Cutlass"
+PROJECT_NAME           = "CUTLASS"

 # The PROJECT_NUMBER tag can be used to enter a project or revision number. This
 # could be handy for archiving the generated documentation or if some version
@ -51,14 +51,14 @@ PROJECT_BRIEF          = "CUDA Templates for Linear Algebra Subroutines and Solv
 # and the maximum width should not exceed 200 pixels. Doxygen will copy the logo
 # to the output directory.

-PROJECT_LOGO           =
+PROJECT_LOGO           = media/images/cutlass-logo-small.png

 # The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path
 # into which the generated documentation will be written. If a relative path is
 # entered, it will be relative to the location where doxygen was started. If
 # left blank the current directory will be used.

-OUTPUT_DIRECTORY       = docs
+OUTPUT_DIRECTORY       = doxygen

 # If the CREATE_SUBDIRS tag is set to YES, then doxygen will create 4096 sub-
 # directories (in 2 levels) under the output directory of each output format and
@ -206,7 +206,7 @@ SEPARATE_MEMBER_PAGES  = NO
 # uses this value to replace tabs by spaces in code fragments.
 # Minimum value: 1, maximum value: 16, default value: 4.

-TAB_SIZE               = 4
+TAB_SIZE               = 2

 # This tag can be used to specify a number of aliases that act as commands in
 # the documentation. An alias has the form:
@ -297,7 +297,7 @@ AUTOLINK_SUPPORT       = YES
 # diagrams that involve STL classes more complete and accurate.
 # The default value is: NO.

-BUILTIN_STL_SUPPORT    = NO
+BUILTIN_STL_SUPPORT    = YES

 # If you use Microsoft's C++/CLI language, you should set this option to YES to
 # enable parsing support.
@ -734,7 +734,9 @@ WARN_LOGFILE           =
 # spaces.
 # Note: If this tag is empty the current directory is searched.

-INPUT                  = cutlass
+INPUT                  = include/cutlass tools/util/include/cutlass/ tools/library/include/cutlass/
+
+INPUT += media/docs/doxygen_mainpage.md

 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
@ -870,7 +872,7 @@ FILTER_SOURCE_PATTERNS =
 # (index.html). This can be useful if you have a project on for instance GitHub
 # and want to reuse the introduction page also for the doxygen output.

-USE_MDFILE_AS_MAINPAGE =
+USE_MDFILE_AS_MAINPAGE = media/docs/doxygen_mainpage.md

 #---------------------------------------------------------------------------
 # Configuration options related to source browsing
@ -999,7 +1001,7 @@ GENERATE_HTML          = YES
 # The default directory is: html.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_OUTPUT            = generated-html
+HTML_OUTPUT            = 

 # The HTML_FILE_EXTENSION tag can be used to specify the file extension for each
 # generated HTML page (for example: .htm, .php, .asp).
@ -1080,7 +1082,7 @@ HTML_EXTRA_FILES       =
 # Minimum value: 0, maximum value: 359, default value: 220.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_COLORSTYLE_HUE    = 82
+HTML_COLORSTYLE_HUE    = 100

 # The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors
 # in the HTML output. For a value of 0 the output will use grayscales only. A
@ -1088,7 +1090,7 @@ HTML_COLORSTYLE_HUE    = 82
 # Minimum value: 0, maximum value: 255, default value: 100.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_COLORSTYLE_SAT    = 100
+HTML_COLORSTYLE_SAT    = 50

 # The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the
 # luminance component of the colors in the HTML output. Values below 100
@ -1107,7 +1109,7 @@ HTML_COLORSTYLE_GAMMA  = 80
 # The default value is: YES.
 # This tag requires that the tag GENERATE_HTML is set to YES.

-HTML_TIMESTAMP         = YES
+HTML_TIMESTAMP         = NO

 # If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML
 # documentation will contain sections that can be hidden and shown after the
--- a/LICENSE.TXT
+++ b/LICENSE.TXT
@ -1,23 +0,0 @@
-Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-   *  Redistributions of source code must retain the above copyright
-      notice, this list of conditions and the following disclaimer.
-   *  Redistributions in binary form must reproduce the above copyright
-      notice, this list of conditions and the following disclaimer in the
-      documentation and/or other materials provided with the distribution.
-   *  Neither the name of the NVIDIA CORPORATION nor the
-      names of its contributors may be used to endorse or promote products
-      derived from this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/LICENSE.txt
+++ b/LICENSE.txt
@ -0,0 +1,27 @@
+Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation
+and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+contributors may be used to endorse or promote products derived from
+this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/PUBLICATIONS.md
+++ b/PUBLICATIONS.md
@ -0,0 +1,30 @@
+# Publications Using Cutlass
+
+## 2023
+
+- ["Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"](https://arxiv.org/abs/2301.03598). Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. _arXiv_, January 2023.
+
+## 2022
+
+- ["GPU Load Balancing"](https://arxiv.org/abs/2212.08964). Muhammad Osama. _Doctoral dissertation, University of California, Davis_, December 2022.
+
+- ["Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance"](https://arxiv.org/abs/2110.15238). Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu. _Proceedings of the 5th MLSys Conference_, August 2022.
+
+- ["Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance"](https://arxiv.org/abs/2203.03341). Hiroyuki Ootomo, Rio Yokota. _International Journal of High Performance Computing_, March 2022.
+
+- ["Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads"](https://arxiv.org/abs/2105.05720). Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Sarikivi. _Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, February 2022.
+
+## 2021
+
+- ["Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs"](https://dl.acm.org/doi/abs/10.1145/3458817.3476184). Jack Kosaian, K. V. Rashmi. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2021.
+
+- ["Real-time Neural Radiance Caching for Path Tracing"](https://d1qx31qr3h6wln.cloudfront.net/publications/paper_4.pdf). Thomas Muller, Fabrice Rousselle, Jan Novak, Alex Keller. _ACM Trans. Graph._, August 2021.
+
+## 2020
+
+- ["Scalable Knowledge Graph Analytics at 136 Petaflop/s"](https://www.computer.org/csdl/proceedings-article/sc/2020/999800a061/1oeORDgCM0g). Ramakrishnan Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar,  Robert Patton, Richard Vuduc, Thomas Potok. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2020.
+
+- ["Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity
+"](https://arxiv.org/abs/2008.13006). Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, Yuhao Zhu. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2020.
+
+- ["Strassen's Algorithm Reloaded on GPUs"](https://dl.acm.org/doi/10.1145/3372419). Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn. _ACM Transactions on Mathematical Software_, March 2020.
--- a/README.md
+++ b/README.md
@ -1,219 +1,561 @@
 ![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")

-# CUTLASS 1.0
+# CUTLASS 3.0

-CUTLASS 1.0 is a collection of CUDA C++ template abstractions for implementing
-high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA.
-It incorporates strategies for hierarchical decomposition and data movement similar
-to those used to implement cuBLAS.  CUTLASS decomposes these "moving parts" into
-reusable, modular software components abstracted by C++ template classes.  These
-thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
-and tuned via custom tiling sizes, data types, and other algorithmic policy. The
-resulting flexibility simplifies their use as building blocks within custom kernels
-and applications.
+_CUTLASS 3.0 - January 2023_
+
+CUTLASS is a collection of CUDA C++ template abstractions for implementing
+high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
+and scales within CUDA. It incorporates strategies for hierarchical decomposition and 
+data movement similar to those used to implement cuBLAS and cuDNN.  CUTLASS decomposes 
+these "moving parts" into reusable, modular software components abstracted by C++ template 
+classes.  Primitives for different levels of a conceptual parallelization hierarchy
+can be specialized and tuned via custom tiling sizes, data types,
+and other algorithmic policy. The resulting flexibility simplifies their use
+as building blocks within custom kernels and applications.

 To support a wide variety of applications, CUTLASS provides extensive support for
 mixed-precision computations, providing specialized data-movement and
-multiply-accumulate abstractions for 8-bit integer, half-precision floating
-point (FP16), single-precision floating point (FP32), and double-precision floating
-point (FP64) types.  Furthermore, CUTLASS demonstrates CUDA's WMMA API for targeting
-the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture
-and beyond.
+multiply-accumulate abstractions for half-precision floating
+point (FP16), BFloat16 (BF16), Tensor Float 32 (TF32),
+single-precision floating point (FP32),
+[FP32 emulation via tensor core instruction](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm),
+double-precision floating
+point (FP64) types, integer data types (4b and 8b), and binary data types (1b).
+CUTLASS demonstrates warp-synchronous matrix multiply operations
+targeting the programmable, high-throughput _Tensor Cores_ implemented by
+NVIDIA's Volta, Turing, Ampere, and Hopper architectures.

-CUTLASS 1.0 has changed substantially from our preview release described in
-the [CUTLASS Parallel For All](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda)
-post. We have decomposed the structure of the GEMM computation into deeper, structured
-primitives for loading data, computing predicate masks, streaming data at each level of
-the GEMM hierarchy, and updating the output matrix.
+See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.

-CUTLASS 1.0 is described in the [Doxygen documentation](https://github.com/NVIDIA/cutlass/docs)
-and our talk at the [GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
+See the [functionality listing](/media/docs/functionality.md) for the list of operations
+supported at each level of the execution model hierarchy.
+
+CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tensors of threads and data.
+CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.
+
+The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
+
+CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates.  This greatly simplifies the design
+and improves code composability and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](/media/docs/cute/00_quickstart.md).
+
+In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
+
+# What's New in CUTLASS 3.0
+
+CUTLASS 3.0, as the next major version of the CUTLASS API, brings with it CuTe, a new programming model and backend designed for massively parallel heterogenous agents. Using CuTe, CUTLASS 3.0 provides implementations of GEMM kernels for the NVIDIA Hopper architecture.
+
+- [CuTe-based layouts and layout algebra](/media/docs/cute/00_quickstart.md)
+- [A new GEMM template API](/media/docs/gemm_api_3x.md) that eschews the architecture-centric hierarchy of 2.x in favour of a new conceptual framing. Read more in the [3.0 design documentation](/media/docs/cutlass_3x_design.md).
+- Support for 4th generation Hopper Tensor Core instructions (WGMMA) through CuTe.
+- Support for Hopper asynchronous Tensor Memory Accelerator (TMA) instructions and associated transaction barriers through CuTe.
+- New warp-specialized GEMM kernels targeting Hopper TMA + WGMMA for speed-of-light GEMMs.
+- New warp-specialized persistent GEMM kernels targeting Hopper TMA + WGMMA.
+- Support for CUDA Threadblock Clusters and programmatic TMA multicast for greater execution and data locality.
+- A new way to instantiate default GEMM kernels using `CollectiveBuilder`s that supersede the 2.x `DefaultXConfiguration` types in favour a metaprogramming based kernel generator functionality. See [example 49](/examples/49_hopper_gemm_schedules_with_collective_builder/49_hopper_gemm_schedules_with_collective_builder.cu).
+- Extensions to the CUTLASS library and profiler to support CUTLASS 3.0 Hopper kernels, and a new format
+for kernel procedural names.
+- *Announcement*: CUTLASS plans to rename the GitHub branch `master` to `main` with a future release.
+
+## New architecture, compiler, and CUDA Toolkit requirements
+
+Minimum requirements:
+
+- Architecture: Volta
+- Compiler: Must support at least C++17
+- CUDA Toolkit version: 11.4
+
+CUTLASS 3.0 *removes support* for the following:
+
+- Maxwell and Pascal GPU architectures
+- Ubuntu 16.04
+- CUDA 10.2
+- C++ language versions less than 17.
+
+**See the [CHANGELOG](CHANGELOG.md) for a detailed listing of releases and updates.**

 # Performance

-<p align="center"><img src=/media/images/cutlass-performance-plot.png></p>
+<p align="center"><img src=media/images/cutlass-3.0-gemm-peak-performance.png></p>

 CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
-they exhibit performance comparable to cuBLAS for scalar GEMM
+they exhibit peak performance comparable to cuBLAS for scalar GEMM
 computations. The above figure shows CUTLASS performance relative to cuBLAS
-for large matrix dimensions (M=10240, N=K=4096) running on an NVIDIA Titan V GPU
-when compiled with CUDA 9.2.
+for large matrix dimensions on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture), 
+an [NVIDIA L40](https://www.nvidia.com/en-us/data-center/l40/) (NVIDIA Ada architecture),
+an [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) (NVIDIA Ampere architecture),  
+and an [NVIDIA A40](https://www.nvidia.com/en-us/data-center/a40/)  (NVIDIA Ampere architecture).
+CUTLASS 3.0 was compiled with the [CUDA 12.0 Toolkit](https://developer.nvidia.com/cuda-downloads). 
+Tensor Core operations are implemented using CUDA's 
+[mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma).
+
+<p align="center"><img src=media/images/cutlass-2.9-implicit-gemm-performance.png></p>
+
+When using CUTLASS building blocks to construct device-wide implicit gemm (Fprop, Dgrad, and Wgrad)
+kernels, CUTLASS performance is also comparable to cuDNN when running Resnet-50 layers on an [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/)
+as shown in the above figure.  Tensor Core operations are still implemented using CUDA's
+[mma instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma).

 # Compatibility

-CUTLASS requires CUDA 9 and performs best with [CUDA 9.2 Toolkit](ttps://developer.nvidia.com/cuda-toolkit) or later.
+CUTLASS requires a C++17 host compiler and 
+performs best when built with the [**CUDA 12.0 Toolkit**](https://developer.nvidia.com/cuda-toolkit).
+It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, and CUDA 11.8.
+
+## Operating Systems
+We have tested the following environments.

 |**Operating System** | **Compiler** |
 |-----------------|----------|
-| Windows 10      | Microsoft Visual Studio 2015|
-|                 | Microsoft Visual Studio 2017|
-| Ubuntu 14.04 | GCC 4.8.2 |
-| Ubuntu 16.04 | GCC 5.4.0 |
+| Ubuntu 18.04 | GCC 7.5.0  |
+| Ubuntu 20.04 | GCC 10.3.0 |
+| Ubuntu 22.04 | GCC 11.2.0 |

+Note: We plan to add Windows (MSVC) & Clang compiler support soon.

-CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on
-any Maxwell-, Pascal-, or Volta-architecture NVIDIA GPU.
+## Hardware
+CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.

-|**GPU**|
-|---|
-|NVIDIA GeForce 1080|
-|NVIDIA TitanXP|
-|NVIDIA Tesla P100|
-|NVIDIA Tesla V100|
-|NVIDIA TitanV|
+|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
+|---|---|---|
+|NVIDIA V100 Tensor Core GPU            |7.0|11.4|
+|NVIDIA TitanV                          |7.0|11.4|
+|NVIDIA GeForce RTX 2080 TI, 2080, 2070 |7.5|11.4|
+|NVIDIA T4                              |7.5|11.4|
+|NVIDIA A100 Tensor Core GPU            |8.0|11.4|
+|NVIDIA A10                             |8.6|11.4|
+|NVIDIA GeForce RTX 3090                |8.6|11.4|
+|NVIDIA GeForce RTX 4090                |8.9|11.8|
+|NVIDIA L40                             |8.9|11.8|
+|NVIDIA H100 Tensor Core GPU            |9.0|11.8|

+## Target Architecture
+
+In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible).  However, CUDA 12.0 introduces the concept of "architecture-accelerated features" whose PTX does not have forward compatibility guarantees. Several Hopper PTX instructions fall under this category of architecture-accelerated features, and thus require a `sm_90a` target architecture (note the "a" appended). For more details on this and other architecture-accelerated instructions, please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
+
+The target architecture information is passed on to CUTLASS via the cmake flag `CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with `90a` as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CTK 12.0 or 11.8, the kernel is expected to fail with a runtime error.
+
+```
+cmake .. -DCUTLASS_NVCC_ARCHS="90a" 
+```
+
+Please refer to the [functionality documentation](media/docs/functionality.md) for details on which kernels require which target architectures.
+
+# Documentation
+
+CUTLASS is described in the following documents and the accompanying
+[Doxygen documentation](https://nvidia.github.io/cutlass).
+
+- [Quick Start Guide](/media/docs/quickstart.md) - build and run CUTLASS
+- [Functionality](/media/docs/functionality.md) - summarizes functionality available in CUTLASS
+- [Efficient GEMM in CUDA](media/docs/efficient_gemm.md) - describes how GEMM kernels may be implemented efficiently in CUDA
+- [CUTLASS 3.x Design](media/docs/cutlass_3x_design.md) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
+- [GEMM API 3.x](media/docs/gemm_api_3x.md) - describes the CUTLASS 3.x GEMM model and C++ template concepts
+- [GEMM API 2.x](media/docs/gemm_api.md) - describes the CUTLASS 2.x GEMM model and C++ template concepts
+- [Implicit GEMM Convolution](media/docs/implicit_gemm_convolution.md) - describes 2-D and 3-D convolution in CUTLASS
+- [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
+- [Terminology](media/docs/terminology.md) - describes terms used in the code
+- [Programming Guidelines](media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
+- [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
+- [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
+- [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
+- [CUTLASS Profiler](media/docs/profiler.md) - command-line driven profiling application
+- [CUTLASS Utilities](media/docs/utilities.md) - additional templates used to facilate rapid development
+
+# Resources
+We have also described the structure of an efficient GEMM in our talk at the
+[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
+
+ - [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
+ - [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
+ - [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
+ - [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
+ - [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)

 # Building CUTLASS

 CUTLASS is a header-only template library and does not need to be built to be used by other
-projects. However, we distribute extensive unit tests and utility programs to demonstrate
-CUTLASS. These instructions are for building those test programs.
+projects. Client applications should target CUTLASS's `include/` directory in their include
+paths.

-CUTLASS's unit tests depend on Google Test which exists as a git submodule. You can fetch
-submodules as follows.
+CUTLASS unit tests, examples, and utilities can be build with CMake starting version 3.12. 
+Make sure the `CUDACXX` environment  variable points to NVCC in the CUDA Toolkit installed
+on your system.

-```
-$ git submodule update --init --recursive
+```bash
+$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
 ```

-CUTLASS can be build with CMake starting version 3.10. By default CUTLASS will build kernels
-for CUDA architecture versions 5.0, 6.0, 6.1 and 7.0. To reduce compile time you can specify
+Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
+for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
+To reduce compile time you can specify
 the architectures to build CUTLASS for by changing the CMake configuration setting
 `CUTLASS_NVCC_ARCHS`.

-Create a build directory within the CUTLASS project, then run CMake once.
-
-```
+```bash
 $ mkdir build && cd build
-$ cmake ..
+
+$ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA's Ampere Architecture
 ```

-Compile the CUTLASS project by running Make. Include the -j argument to compile sources in
-parallel and speed up the build process.
+From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.

-```
-$ make -j12
-...
-$
-```
+The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
+and they may be executed in parallel via make's `-j` command line argument.

-Verify CUTLASS has been built correctly by running the unit tests from the build/ directory.
-
-```
-$ ./tools/test/unit/cutlass_unit_test
+```bash
+$ make test_unit -j
 ...
 ...
 ...
 [----------] Global test environment tear-down
-[==========] 481 tests from 24 test cases ran. (5954 ms total)
-[  PASSED  ] 481 tests.
+[==========] 946 tests from 57 test cases ran. (10812 ms total)
+[  PASSED  ] 946 tests.
 ```

-All tests should pass, though the exact number of tests may vary over time.
+All tests should pass on supported platforms, though the exact number of tests may vary over time.


 # Project Structure

-CUTLASS is arranged as a header-only library with several example test programs
-that demonstrate instantiating a GEMM task within a CUDA kernel. The Doxygen documentation
-provides a complete list of files, classes, and template concepts defined in the CUTLASS
-project. A brief summary is described below.
+CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests. 
+[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes, 
+and template concepts defined in the CUTLASS project.

-The CUTLASS library is defined in the cutlass/ directory and consists of CUDA C++ template
-classes and other definitions for implementing efficient GPU GEMM kernels. A set of core
-classes and templates define basic primitives that are then applied to compute GEMM via
-templates in the cutlass/gemm directory.
+A detailed explanation of the source code organization may be found in the 
+[CUTLASS documentation](media/docs/code_organization.md), but several main components are summarized below.
+
+## CUTLASS Template Library

 ```
-cutlass/
-    gemm/
-    util/
-    <core API components>
+include/                     # client applications should target this directory in their build's include paths
+
+  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
+
+    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
+
+    conv/                    # code specialized for convolution
+
+    epilogue/                # code specialized for the epilogue of gemm/convolution
+
+    gemm/                    # code specialized for general matrix product computations
+
+    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
+
+    platform/                # CUDA-capable Standard Library components
+
+    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" model
+
+    thread/                  # simt code that can be performed within a CUDA thread
+    
+    transform/               # code specialized for layout, type, and domain transformations
+
+    *                        # core vocabulary types, containers, and basic numeric operations
+
+  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
+
+    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples
+
+    arch/                    # Bare bones PTX wrapper structs for copy and math instructions
+
+    atom/                    # Meta-information either link to or built from arch/ operators
+
+      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma
+
+      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy
+
+      *sm*.hpp               # Arch specific meta-information for copy and math operations
+
+    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
+
 ```

-Several tools and test programs are also distributed with the CUTLASS library. They are
-contained in the following directories.
+### CUTLASS SDK Examples
+
+[CUTLASS SDK examples](/examples) apply CUTLASS templates to implement basic computations.
+
+### Tools

 ```
 tools/
-    test/
-        unit/
-            core/
-            gemm/
-        perf/
-    util/
-        <utilities>
+  library/                   # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
+    include/
+      cutlass/
+        library/
+
+  profiler/                  # CUTLASS Profiler         - command-line utility for executing operations in the
+                             #                            CUTLASS Library
+  
+  util/                      # CUTLASS Utilities        - contains numerous helper classes for
+    include/                 #                            manging tensors in device memory, reference
+      cutlass/               #                            implementations for GEMM, random initialization
+        util/                #                            of tensors, and I/O.
 ```

+### Test
+
 The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
 basic usage of Core API components and complete tests of the CUTLASS GEMM computations.

+Instructions for building and running the Unit tests are described in the [Quickstart guide](media/docs/quickstart.md).
+
 # Performance Profiling

-The `test/perf/` directory contains a command-line utility for launching each of the GEMM kernels.
-Its usage is shown below.
+The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
+It can be built as follows:

-Program usage:
+```bash
+$ make cutlass_profiler -j16
+```
+## Building all GEMM and Convolution kernels (_long_ build times)
+
+By default, only one tile size is instantiated for each data type, math instruction, and layout.
+To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
+Beware, this results in *tens of thousands* of kernels and long build times. 
+This would also result in a large binary size and on some platforms linker to fail on building the library.
+Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
+...
+$ make cutlass_profiler -j16
+```
+
+## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
+
+To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
+wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
+or a subset of kernels for NVIDIA Ampere and Turing architecture:
+
+### Building a subset Tensor Core GEMM kernels
+
+To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture, 
+use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
+```bash
+./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+          cuBLAS: Passed
+
+       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
+                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
+                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
+                  --max_cc=1024
+
+           Bytes: 118489088  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 1.55948  ms
+          Memory: 70.7616 GiB/s
+
+            Math: 74378.8 GFLOP/s
+
+
+
+=============================
+...
+```
+
+### Building one CUDA Core GEMM kernel
+
+To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling single SGEMM CUDA kernel is as follows:
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+          cuBLAS: Passed
+
+       Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
+                  --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 180355072  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 6.73655  ms
+          Memory: 24.934 GiB/s
+
+            Math: 17218.4 GFLOP/s
+
+=============================
+```
+
+### Building a subset of Tensor Core Convolution kernels
+
+To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation 
+and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
+                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
+
+           Bytes: 1130659840  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 0.711496  ms
+          Memory: 1479.99 GiB/s
+
+            Math: 166526 GFLOP/s
+
+=============================
+...
+```
+
+
+### Building one Convolution CUDA kernel
+
+To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation 
+and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling one CUDA Core convolution kernel:
+
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
+                  --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
+
+           Bytes: 2055798784  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 7.34266  ms
+          Memory: 260.752 GiB/s
+
+            Math: 16136.2 GFLOP/s
+
+
+=============================

 ```
-  cutlass_perf_test [options]

-    --help
-    --append=<true|false*>                            If true, appends output to existing CSV file. If false, overwrites.
-    --alpha=<alpha>                                   Value for alpha to be used in GEMM experiments
-    --beta=<beta>                                     Value for beta to be used in GEMM experiments
-    --dist=<distribution>                             Describes the random distribution of each of the input matrix operands.
-    --execution_mode=<mode>                           Specifies execution mode: profile, verify, single
-    --output=<filename.csv>                           Writes summary of profiling to specified .csv file
-    --iterations=<timing iterations>                  maximum number of iterations to execute when profiling
-    --m=<height>[:max height[:step]]                  Height of GEMM problem (number of rows of C). May specify a range with optional step size.
-    --n=<width>[:max width[:step]]                    Width of GEMM problem (number of columns of C). May specify a range with optional step size.
-    --k=<depth>[:max depth[:step]]                    Size of inner dimension of A and B. May specify a range with optional step size.
-    --kernels=<{s|d|h|i|wmma}gemm_{nn,nt,tn,tt}>      Select GEMM datatype and layout to use for tests
-    --peak=<bool>                                     If true, only reports peak performance per kernel after profiling specified problem space.
-    --save_workspace={*never,incorrect,always}        Specifies when to save the GEMM inputs and results to the filesystem.
-    --seed=<seed>                                     Random seed used by the random number generator in initializing input matrices.
-    --tags=<column:tag,...>                           Inserts leading columns in output table and uniform values for each column.
+## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
+- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
+  - [GEMM CMake Examples](media/docs/quickstart.md#gemm-cmake-examples) 
+  - [Implicit GEMM conovlution CMake Examples](media/docs/quickstart.md#convolution-cmake-examples)
+- [Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)


-  Example usage:
-
-  # Runs one problem size for all kernels
-  $ ./tools/test/perf/cutlass_perf_test --m=10240 --n=1024 --k=1024
-
-  # Varies GEMM K dimension for SGEMM and IGEMM with column-major multiplicands
-  $ ./tools/test/perf/cutlass_perf_test --m=10240 --n=4096 --k=1024:8192:128 --kernels=sgemm_nn,igemm_nn
-```
-
 # About

-CUTLASS is released by NVIDIA Corporation as Open Source software under the
-3-clause "New" BSD license.
+CUTLASS is released by NVIDIA Corporation as Open Source software under the 
+[3-clause "New" BSD license](LICENSE.txt).

+# Contributors
+
+The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).

 # Copyright

-Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
+Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: BSD-3-Clause

 ```
-  Redistribution and use in source and binary forms, with or without modification, are permitted
-  provided that the following conditions are met:
-      * Redistributions of source code must retain the above copyright notice, this list of
-        conditions and the following disclaimer.
-      * Redistributions in binary form must reproduce the above copyright notice, this list of
-        conditions and the following disclaimer in the documentation and/or other materials
-        provided with the distribution.
-      * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
-        to endorse or promote products derived from this software without specific prior written
-        permission.
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:

-  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
-  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
-  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
-  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
-  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
-  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
-  STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  1. Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 ```

--- a/CMake/bin2hex.cmake
+++ b/CMake/bin2hex.cmake
@ -13,8 +13,8 @@ function(FILE_TO_C_STRING FILENAME VARIABLE_NAME OUTPUT_STRING ZERO_TERMINATED)
  set(${OUTPUT_STRING} "${HEX_OUTPUT}" PARENT_SCOPE)
 endfunction()

-message("Create header file for ${FILE_IN}")
-message("Create header file for ${FILE_OUT}")
+# message("Create header file for ${FILE_IN}")
+# message("Create header file for ${FILE_OUT}")
 file_to_c_string(${FILE_IN} ${VARIABLE_NAME} OUTPUT_STRING ZERO_TERMINATED)

 set(RESULT "#pragma once\n")
--- a/clang-format.sh
+++ b/clang-format.sh
@ -1,17 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function formatFiles {
-    for f in `find "$1" -type f -name "*.$2"` ; do
-        COMMAND="clang-format -i $f"
-        echo $COMMAND
-        $COMMAND
-    done
-}
-
-formatFiles "cutlass" "h"
-formatFiles "tools/test" "h"
-formatFiles "tools/test" "cpp"
-formatFiles "tools/util" "h"
-
--- a/cmake/CTestTestfile.config.cmake
+++ b/cmake/CTestTestfile.config.cmake
@ -0,0 +1,21 @@
+# Generated file
+
+if (DEFINED ENV{CUTLASS_TEST_EXECUTION_ENVIRONMENT})
+  set(_CUTLASS_TEST_EXECUTION_ENVIRONMENT $ENV{CUTLASS_TEST_EXECUTION_ENVIRONMENT})
+else()
+  set(_CUTLASS_TEST_EXECUTION_ENVIRONMENT @CUTLASS_TEST_EXECUTION_ENVIRONMENT@)
+endif()
+
+if (NOT "@TEST_EXE_DIR@" STREQUAL "")
+  set(TEST_EXE_PATH @TEST_EXE_DIR@/@TEST_EXE@)
+else()
+  set(TEST_EXE_PATH @TEST_EXE@)
+endif()
+
+add_test("@TEST_NAME@" ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
+
+if (NOT "@TEST_EXE_WORKING_DIRECTORY@" STREQUAL "")
+  set_tests_properties("@TEST_NAME@" PROPERTIES WORKING_DIRECTORY "@TEST_EXE_WORKING_DIRECTORY@")
+endif()
+
+set_tests_properties(@TEST_NAME@ PROPERTIES DISABLED @__DISABLE_TESTS@)
--- a/cmake/NvidiaCutlassConfig.cmake
+++ b/cmake/NvidiaCutlassConfig.cmake
@ -0,0 +1,7 @@
+get_filename_component(NvidiaCutlass_CMAKE_DIR "${CMAKE_CURRENT_LIST_FILE}" PATH)
+
+include(CMakeFindDependencyMacro)
+
+if(NOT TARGET nvidia::cutlass::CUTLASS)
+    include("${NvidiaCutlass_CMAKE_DIR}/NvidiaCutlassTargets.cmake")
+endif()
--- a/cmake/NvidiaCutlassPackageConfig.cmake
+++ b/cmake/NvidiaCutlassPackageConfig.cmake
@ -0,0 +1,14 @@
+set(CPACK_PACKAGE_NAME NvidiaCutlass)
+set(CPACK_PACKAGE_VENDOR NVIDIA)
+set(CPACK_PACKAGE_CONTACT info@nvidia.com)
+set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "CUTLASS CUDA C++ Template Linear Algebra Library")
+set(CPACK_PACKAGE_INSTALL_DIRECTORY ${CPACK_PACKAGE_NAME})
+set(CPACK_PACKAGE_VERSION_MAJOR ${PROJECT_VERSION_MAJOR})
+set(CPACK_PACKAGE_VERSION_MINOR ${PROJECT_VERSION_MINOR})
+set(CPACK_PACKAGE_VERSION_PATCH ${PROJECT_VERSION_PATCH})
+set(CPACK_VERBATIM_VARIABLES YES)
+# set(CPACK_PACKAGE_DESCRIPTION_FILE ${CMAKE_CURRENT_LIST_DIR}/Description.txt)
+# set(CPACK_RESOURCE_FILE_WELCOME ${CMAKE_CURRENT_LIST_DIR}/Welcome.txt)
+# set(CPACK_RESOURCE_FILE_LICENSE ${CMAKE_CURRENT_LIST_DIR}/License.txt)
+# set(CPACK_RESOURCE_FILE_README ${CMAKE_CURRENT_LIST_DIR}/Readme.txt)
+include(CPack)
--- a/cmake/googletest.cmake
+++ b/cmake/googletest.cmake
@ -0,0 +1,23 @@
+include(FetchContent)
+
+set(GOOGLETEST_DIR "" CACHE STRING "Location of local GoogleTest repo to build against")
+
+if(GOOGLETEST_DIR)
+  set(FETCHCONTENT_SOURCE_DIR_GOOGLETEST ${GOOGLETEST_DIR} CACHE STRING "GoogleTest source directory override")
+endif()
+
+FetchContent_Declare(
+  googletest
+  GIT_REPOSITORY https://github.com/google/googletest.git
+  GIT_TAG        0fe9660
+  )
+
+FetchContent_GetProperties(googletest)
+
+if(NOT googletest_POPULATED)
+  FetchContent_Populate(googletest)
+  if (MSVC)
+    set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
+  endif()
+  add_subdirectory(${googletest_SOURCE_DIR} ${googletest_BINARY_DIR} EXCLUDE_FROM_ALL)
+endif()
--- a/cmake/nop.cu
+++ b/cmake/nop.cu
@ -0,0 +1,49 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Basic CUDA file for testing compiler flags.
+*/
+
+__device__ int inner()
+{
+  return -1;
+}
+
+__global__ void test()
+{
+  inner();
+}
+
+int main()
+{
+  test<<<1,1>>>();
+  return 0;
+}
--- a/cmake/version.h.in
+++ b/cmake/version.h.in
@ -0,0 +1,38 @@
+#include <cstdint>
+#include <string>
+
+#define CUTLASS_MAJOR @CUTLASS_VERSION_MAJOR@
+#define CUTLASS_MINOR @CUTLASS_VERSION_MINOR@
+#define CUTLASS_PATCH @CUTLASS_VERSION_PATCH@
+#define CUTLASS_BUILD @CUTLASS_VERSION_BUILD@
+#define CUTLASS_VERSION ((CUTLASS_MAJOR)*100 + (CUTLASS_MINOR)*10 + CUTLASS_PATCH)
+
+namespace cutlass {
+
+  inline uint32_t getVersion() {
+    return CUTLASS_VERSION;
+  }
+  inline uint32_t getVersionMajor() {
+    return CUTLASS_MAJOR;
+  }
+  inline uint32_t getVersionMinor() {
+    return CUTLASS_MINOR;
+  }
+  inline uint32_t getVersionPatch() {
+    return CUTLASS_PATCH;
+  }
+  inline uint32_t getVersionBuild() {
+    return CUTLASS_BUILD + 0;
+  }
+  inline std::string getVersionString() {
+    std::string version = "@CUTLASS_VERSION@";
+    if (getVersionBuild()) {
+      version += "." + std::to_string(getVersionBuild());
+    }
+    return version;
+  }
+  inline std::string getGitRevision() {
+    return "@CUTLASS_REVISION@";
+  }
+
+} // namespace cutlass
--- a/cuBLAS.cmake
+++ b/cuBLAS.cmake
@ -0,0 +1,152 @@
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+message(STATUS "Configuring cublas ...")
+
+if((DEFINED CUTLASS_ENABLE_CUBLAS AND NOT CUTLASS_ENABLE_CUBLAS) OR
+   (DEFINED CUBLAS_ENABLED AND NOT CUBLAS_ENABLED))
+  
+  # Don't add cuBLAS if it's defined and false, assume it's not found.
+
+  set(CUBLAS_FOUND OFF)
+  message(STATUS "cuBLAS Disabled.")
+
+elseif(NOT TARGET cublas)
+ 
+  find_path(
+    _CUBLAS_INCLUDE_DIR
+    NAMES cublas_v2.h
+    HINTS
+      ${CUBLAS_INCLUDE_PATH}
+      ENV CUBLAS_INCLUDE_PATH
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      include
+    )
+
+  find_library(
+    _CUBLAS_LIBRARY
+    NAMES cublas
+    HINTS
+      ${CUBLAS_LIBRARY_PATH}
+      ENV CUBLAS_LIBRARY_PATH
+      ${_CUBLAS_INCLUDE_DIR}/..
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      lib64
+      lib/x64
+      lib
+    )
+
+  if(_CUBLAS_INCLUDE_DIR AND _CUBLAS_LIBRARY)
+
+    message(STATUS "cuBLAS: ${_CUBLAS_LIBRARY}")
+    message(STATUS "cuBLAS: ${_CUBLAS_INCLUDE_DIR}")
+    
+    set(CUBLAS_FOUND ON CACHE INTERNAL "cublas Library Found")
+    set(CUBLAS_LIBRARY ${_CUBLAS_LIBRARY})
+    set(CUBLAS_INCLUDE_DIR ${_CUBLAS_INCLUDE_DIR})
+
+  else()
+
+    message(STATUS "cublas not found.")
+    set(CUBLAS_FOUND OFF CACHE INTERNAL "cublas Library Found")
+
+  endif()
+
+endif()
+
+set(CUTLASS_ENABLE_CUBLAS ${CUBLAS_FOUND} CACHE BOOL "Enable CUTLASS to build with cuBLAS library.")
+
+if(CUTLASS_ENABLE_CUBLAS AND NOT CUBLAS_FOUND)
+  message(FATAL_ERROR "CUTLASS_ENABLE_CUBLAS enabled but cuBLAS library could not be found.")
+endif()
+
+if(CUTLASS_ENABLE_CUBLAS AND NOT TARGET cublas)
+
+  if(WIN32)
+    add_library(cublas STATIC IMPORTED GLOBAL)
+  else()
+    add_library(cublas SHARED IMPORTED GLOBAL)
+  endif()
+
+  add_library(nvidia::cublas ALIAS cublas)
+
+  set_property(
+    TARGET cublas
+    PROPERTY IMPORTED_LOCATION
+    ${CUBLAS_LIBRARY})
+    
+  target_include_directories(
+    cublas
+    INTERFACE
+    $<INSTALL_INTERFACE:include>
+    $<BUILD_INTERFACE:${CUBLAS_INCLUDE_DIR}>)
+
+  find_library(
+    _CUBLASLT_LIBRARY
+    NAMES cublasLt
+    HINTS
+      ${CUBLAS_LIBRARY_PATH}
+      ENV CUBLAS_LIBRARY_PATH
+      ${_CUBLAS_INCLUDE_DIR}/..
+      ${CUBLAS_PATH}
+      ENV CUBLAS_PATH
+      ${CUDA_TOOLKIT_ROOT_DIR}
+    PATH_SUFFIXES
+      lib64
+      lib/x64
+      lib
+    )
+
+  if(_CUBLASLT_LIBRARY AND NOT TARGET cublasLt)
+
+    if(WIN32)
+      add_library(cublasLt STATIC IMPORTED GLOBAL)
+    else()
+      add_library(cublasLt SHARED IMPORTED GLOBAL)
+    endif()
+    
+    set_property(
+      TARGET cublasLt
+      PROPERTY IMPORTED_LOCATION
+      ${_CUBLASLT_LIBRARY})
+  
+    add_library(nvidia::cublasLt ALIAS cublasLt)
+
+    target_link_libraries(cublas INTERFACE cublasLt)
+
+  endif()
+
+endif()
+
+message(STATUS "Configuring cuBLAS ... done.")
--- a/cuDNN.cmake
+++ b/cuDNN.cmake
@ -0,0 +1,112 @@
+# Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+if(DEFINED CUDNN_ENABLED)
+    set(CUTLASS_ENABLE_CUDNN ${CUDNN_ENABLED} CACHE BOOL "Enable CUTLASS to build with cuDNN library.")
+endif()
+
+if(DEFINED CUTLASS_ENABLE_CUDNN AND NOT CUTLASS_ENABLE_CUDNN)
+  return()
+endif()
+  
+message(STATUS "Configuring cuDNN ...")
+
+find_path(
+    _CUDNN_INCLUDE_DIR cudnn.h
+    PATHS
+    ${CUDA_TOOLKIT_ROOT_DIR}/include
+    $ENV{CUDNN_PATH}/include
+    $ENV{CUDA_PATH}/include
+    ${CUDNN_PATH}/include
+    /usr/include)
+
+find_library(
+    _CUDNN_LIBRARY cudnn
+    HINTS
+    ${CUDA_TOOLKIT_ROOT_DIR}/lib64
+    ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+    ${CUDA_TOOLKIT_ROOT_DIR}/lib
+    $ENV{CUDNN_PATH}/lib64
+    $ENV{CUDNN_PATH}/lib/x64
+    $ENV{CUDNN_PATH}/lib
+    $ENV{CUDA_PATH}/lib64
+    $ENV{CUDA_PATH}/lib/x64
+    $ENV{CUDA_PATH}/lib
+    ${CUDNN_PATH}/lib64
+    ${CUDNN_PATH}/lib/x64
+    ${CUDNN_PATH}/lib
+    /usr/lib/x86_64-linux-gnu
+    /usr/lib)
+
+if(_CUDNN_INCLUDE_DIR AND _CUDNN_LIBRARY)
+
+    message(STATUS "cuDNN: ${_CUDNN_LIBRARY}")
+    message(STATUS "cuDNN: ${_CUDNN_INCLUDE_DIR}")
+    
+    set(CUDNN_FOUND ON CACHE INTERNAL "cuDNN Library Found")
+
+else()
+
+    message(STATUS "cuDNN not found.")
+    set(CUDNN_FOUND OFF CACHE INTERNAL "cuDNN Library Found")
+
+endif()
+
+set(CUTLASS_ENABLE_CUDNN ${CUDNN_FOUND} CACHE BOOL "Enable CUTLASS to build with cuDNN library.")
+
+if (CUTLASS_ENABLE_CUDNN AND NOT TARGET cudnn)
+
+  set(CUDNN_INCLUDE_DIR ${_CUDNN_INCLUDE_DIR})
+  set(CUDNN_LIBRARY ${_CUDNN_LIBRARY})
+
+  if(WIN32)
+    add_library(cudnn STATIC IMPORTED GLOBAL)
+  else()
+    add_library(cudnn SHARED IMPORTED GLOBAL)
+  endif()
+
+  add_library(nvidia::cudnn ALIAS cudnn)
+
+  set_property(
+    TARGET cudnn
+    PROPERTY IMPORTED_LOCATION
+    ${CUDNN_LIBRARY})
+    
+  target_include_directories(
+    cudnn
+    INTERFACE
+    $<INSTALL_INTERFACE:include>
+    $<BUILD_INTERFACE:${CUDNN_INCLUDE_DIR}>)
+
+endif()
+
+if(CUTLASS_ENABLE_CUDNN AND NOT CUDNN_FOUND)
+  message(FATAL_ERROR "CUTLASS_ENABLE_CUDNN enabled but cuDNN library could not be found.")
+endif()
+
+message(STATUS "Configuring cuDNN ... done.")
--- a/cutlass/convert.h
+++ b/cutlass/convert.h
@ -1,102 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*!
-  \file
-  \brief Defines conversion operations among Fragments of different base type.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename InputFragment_, typename OutputFragment_>
-struct Convert {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename InputScalar_, typename OutputScalar_, int kScalars_>
-struct Convert<Fragment<InputScalar_, kScalars_>, Fragment<OutputScalar_, kScalars_> > {
-  /// The input fragment.
-  typedef Fragment<InputScalar_, kScalars_> InputFragment;
-  /// The output fragment.
-  typedef Fragment<OutputScalar_, kScalars_> OutputFragment;
-
-  /// Ctor.
-  CUTLASS_DEVICE Convert() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
-    transform(src, 0, dst);
-  }
-
-  /// Transform a fragment.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
-    for (int i = 0; i < kScalars_; ++i) {
-      dst[i] = static_cast<OutputScalar_>(src[i + offset]);
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Fragment_>
-struct Copy {
-  /// The input fragment.
-  typedef Fragment_ InputFragment;
-  /// The output fragment.
-  typedef Fragment_ OutputFragment;
-
-  /// Ctor.
-  CUTLASS_DEVICE Copy() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(Fragment_ const& src, Fragment_& dst) { transform(src, 0, dst); }
-
-  /// Transform a fragment.
-  template <typename InputFragment_>
-  CUTLASS_DEVICE void transform(InputFragment_ const& src, int offset, Fragment_& dst) {
-    if (sizeof(typename Fragment_::Element) == 8) {
-      uint64_t const* src_ptr = reinterpret_cast<uint64_t const*>(&src[offset]);
-      uint64_t* dst_ptr = reinterpret_cast<uint64_t*>(&dst[0]);
-      for (int i = 0; i < sizeof(Fragment_) / 8; ++i) {
-        dst_ptr[i] = src_ptr[i];
-      }
-    } else {
-      uint32_t const* src_ptr = reinterpret_cast<uint32_t const*>(&src[offset]);
-      uint32_t* dst_ptr = reinterpret_cast<uint32_t*>(&dst[0]);
-      for (int i = 0; i < sizeof(Fragment_) / 4; ++i) {
-        dst_ptr[i] = src_ptr[i];
-      }
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/coord.h
+++ b/cutlass/coord.h
@ -1,287 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief A Coord is a coordinate of arbitrary rank into a tensor or matrix
-*/
-
-#pragma once
-
-#include <cutlass/cutlass.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Describes identity elements
-struct Identity {
-  /// Enumeration describing identity elements. Value assignments are significant.
-  /// Feel free to add or multiply by these, respectively.
-  enum Kind { Additive = 0, Multiplicative = 1 };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Statically-sized array specifying Coords within a tensor
-template <int N_>
-struct Coord {
-  //
-  // Type and constant definitions
-  //
-
-  static int const N = N_;
-
-  //
-  // Data members
-  //
-
-  /// Indices
-  int idx[N];
-
-  //
-  // Methods
-  //
-
-  /// Default ctor initializes uniformly
-  CUTLASS_HOST_DEVICE
-  Coord(int value = 0) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] = value;
-    }
-  }
-
-  /// Constructs from an array of integers
-  CUTLASS_HOST_DEVICE
-  Coord(int _idx[]) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] = _idx[i];
-    }
-  }
-
-  /// Element-wise addition
-  CUTLASS_HOST_DEVICE
-  Coord operator+(Coord const& b) const {
-    Coord c;
-    for (int i = 0; i < N; ++i) {
-      c.idx[i] = idx[i] + b.idx[i];
-    }
-    return c;
-  }
-
-  /// Element-wise subtraction
-  CUTLASS_HOST_DEVICE
-  Coord operator-(Coord const& b) const {
-    Coord c;
-    for (int i = 0; i < N; ++i) {
-      c.idx[i] = idx[i] - b.idx[i];
-    }
-    return c;
-  }
-
-  /// Element-wise multiplication
-  CUTLASS_HOST_DEVICE
-  Coord operator*(Coord const& b) const {
-    Coord c;
-    for (int i = 0; i < N; ++i) {
-      c.idx[i] = idx[i] * b.idx[i];
-    }
-    return c;
-  }
-
-  /// Element-wise division
-  CUTLASS_HOST_DEVICE
-  Coord operator/(Coord const& b) const {
-    Coord c;
-    for (int i = 0; i < N; ++i) {
-      c.idx[i] = idx[i] / b.idx[i];
-    }
-    return c;
-  }
-
-  /// In-place addition
-  CUTLASS_HOST_DEVICE
-  Coord& operator+=(Coord const& b) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] += b.idx[i];
-    }
-    return *this;
-  }
-
-  /// In-place subtraction
-  CUTLASS_HOST_DEVICE
-  Coord& operator-=(Coord const& b) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] -= b.idx[i];
-    }
-    return *this;
-  }
-
-  /// In-place multiplication
-  CUTLASS_HOST_DEVICE
-  Coord& operator*=(Coord const& b) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] *= b.idx[i];
-    }
-    return *this;
-  }
-
-  /// In-place division
-  CUTLASS_HOST_DEVICE
-  Coord& operator/=(Coord const& b) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] /= b.idx[i];
-    }
-    return *this;
-  }
-
-  /// Member access operator
-  CUTLASS_HOST_DEVICE int& operator[](int dim) { return idx[dim]; }
-
-  /// Member access operator
-  CUTLASS_HOST_DEVICE int const& operator[](int dim) const { return idx[dim]; }
-
-  /// Computes the dot product of two Coord instances
-  template <typename T>
-  CUTLASS_HOST_DEVICE T dot(Coord const& b, T sum) const {
-    for (int i = 0; i < N; ++i) {
-      sum += idx[i] * b.idx[i];
-    }
-    return sum;
-  }
-
-  /// Computes the dot product of two Coord instances
-  template <typename T>
-  CUTLASS_HOST_DEVICE T dot(Coord const& b) const {
-    T sum = T(0);
-    for (int i = 0; i < N; ++i) {
-      sum += idx[i] * b.idx[i];
-    }
-    return sum;
-  }
-
-  /// Gets the index of a given Coord element
-  template <int Dim>
-  CUTLASS_HOST_DEVICE int& at() {
-    return idx[Dim];
-  }
-
-  /// Access via index; may limit unrolling potential
-  CUTLASS_HOST_DEVICE
-  int& at(int dim) { return idx[dim]; }
-
-  /// Gets the index of a given Coord element
-  template <int Dim>
-  CUTLASS_HOST_DEVICE int const& at() const {
-    return idx[Dim];
-  }
-
-  /// Access via index; may limit unrolling potential
-  CUTLASS_HOST_DEVICE
-  int const& at(int dim) const { return idx[dim]; }
-
-  /// Determines if two Coord<> objects are equal
-  CUTLASS_HOST_DEVICE
-  bool operator==(Coord<N> const& b) const {
-    bool equal = true;
-    for (int i = 0; equal && i < N; ++i) {
-      equal = (idx[i] == b.idx[i]);
-    }
-    return equal;
-  }
-
-  /// Not equal
-  CUTLASS_HOST_DEVICE
-  bool operator!=(Coord<N> const& b) const { return !(*this == b); }
-
-  /// Clamps a coordinate to a range specified by maximum and minimum values
-  CUTLASS_HOST_DEVICE
-  Coord& clamp(Coord<N> const& max, Coord<N> const& min = Coord<N>()) {
-    for (int i = 0; i < N; ++i) {
-      idx[i] = __NV_STD_MAX(__NV_STD_MIN(idx[i], max.idx[i]), min.idx[i]);
-    }
-    return *this;
-  }
-
-  /// Returns the product of all elements
-  CUTLASS_HOST_DEVICE
-  int count() const {
-    int product = idx[0];
-    for (int i = 1; i < N; ++i) {
-      product *= idx[i];
-    }
-    return product;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Helper to make a 2-element coordinate
-CUTLASS_HOST_DEVICE
-Coord<1> make_Coord(int _0) {
-  int values[1] = {_0};
-  return Coord<1>(values);
-}
-
-/// Helper to make a 2-element coordinate
-CUTLASS_HOST_DEVICE
-Coord<2> make_Coord(int _0, int _1) {
-  int values[2] = {_0, _1};
-  return Coord<2>(values);
-}
-
-/// Helper to make a 3-element coordinate
-CUTLASS_HOST_DEVICE
-Coord<3> make_Coord(int _0, int _1, int _2) {
-  int values[3] = {_0, _1, _2};
-  return Coord<3>(values);
-}
-
-/// Helper to make a 4-element coordinate
-CUTLASS_HOST_DEVICE
-Coord<4> make_Coord(int _0, int _1, int _2, int _3) {
-  int values[4] = {_0, _1, _2, _3};
-  return Coord<4>(values);
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Getter
-CUTLASS_HOST_DEVICE
-Coord<2> get_Coord_hw(Coord<3> const& coord) { return make_Coord(coord[1], coord[2]); }
-
-/// Getter
-CUTLASS_HOST_DEVICE
-Coord<2> get_Coord_hw(Coord<4> const& coord) { return make_Coord(coord[1], coord[2]); }
-
-/// Getter
-CUTLASS_HOST_DEVICE
-Coord<3> get_Coord_hwc(Coord<4> const& coord) { return make_Coord(coord[1], coord[2], coord[3]); }
-
-/// Getter
-CUTLASS_HOST_DEVICE
-Coord<3> get_Coord_dhw(Coord<4> const& coord) { return make_Coord(coord[0], coord[1], coord[2]); }
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/core_io.h
+++ b/cutlass/core_io.h
@ -1,44 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-#pragma once
-
-/*! \file
-    \brief Helpers for printing cutlass/core objects
-*/
-
-#pragma once
-
-#include <iosfwd>
-#include <typeinfo>
-
-#include <cutlass/coord.h>
-
-template <int Rank>
-std::ostream& operator<<(std::ostream& out, cutlass::Coord<Rank> const& coord) {
-  for (int i = 0; i < Rank; ++i) {
-    out << (i ? ", " : "") << coord.idx[i];
-  }
-  return out;
-}
--- a/cutlass/cutlass.h
+++ b/cutlass/cutlass.h
@ -1,73 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-/*! \file
-    \brief Basic include for CUTLASS macros
-*/
-
-#pragma once
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#define CUTLASS_MAJOR 1
-#define CUTLASS_MINOR 0
-#define CUTLASS_PATCH 0
-#define CUTLASS_VERSION ((CUTLASS_MAJOR)*100 + (CUTLASS_MINOR)*10 + CUTLASS_PATCH)
-
-#ifdef __NVCC__
-#define CUTLASS_HOST_DEVICE __forceinline__ __device__ __host__
-#define CUTLASS_DEVICE __forceinline__ __device__
-#elif defined(__CUDACC_RTC__)
-#define CUTLASS_HOST_DEVICE __forceinline__ __device__
-#define CUTLASS_DEVICE __forceinline__ __device__
-#else
-#define CUTLASS_HOST_DEVICE
-// CUTLASS_DEVICE is an error if not compiling device code
-#endif
-
-// CUTLASS_PRAGMA_UNROLL inserts a CUTLASS_PRAGMA_UNROLL if supported by the compiler
-#if defined(__CUDA_ARCH__)
-#if defined(_MSC_VER)
-#define CUTLASS_PRAGMA_UNROLL __pragma("unroll")
-#define CUTLASS_PRAGMA_NO_UNROLL __pragma("unroll 1")
-#else
-#define CUTLASS_PRAGMA_UNROLL _Pragma("unroll")
-#define CUTLASS_PRAGMA_NO_UNROLL _Pragma("unroll 1")
-#endif
-#else
-#define CUTLASS_PRAGMA_UNROLL
-#define CUTLASS_PRAGMA_NO_UNROLL
-#endif
-
-#define CUTLASS_ASSERT(x) assert(x)
-
-namespace cutlass {
-
-/// NVIDIA GPU Warp size
-static const int kWarpSize = 32;
-
-}  // namespace cutlass
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
--- a/cutlass/fragment.h
+++ b/cutlass/fragment.h
@ -1,278 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines Fragment, a statically-sized array for storing parts of matrices within a
-   thread's registers.
-*/
-#pragma once
-
-#include <assert.h>
-#include <cutlass/shape.h>
-#include <cutlass/util/cutlass_math.h>
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup fragment_concept Fragment Concept
-@{
-
-\ref fragment_concept is a statically sized array for storing parts of tiles held by individual CUDA
-threads.
-
-@par \ref fragment_concept
-   Types satisfying \ref fragment_concept define the following members
-  - <b>Element</b> - type of each access held within the fragment
-  - <b>kElements</b> - number of elements stored by the fragment
-  - <b>clear()</b> - overwrites the fragment storage with zeros
-  - <b>Element & operator[](int i)</b> - by-reference access of the ith element
-  - <b>Element const & operator[](int i) const</b> - const by-reference access of the ith element
-@}
-*/
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup fragment_iterator_concept Fragment Iterator Concept
-@{
-
-\ref fragment_iterator_concept provides structured access to the elements within a fragment with an
-optional bitcast to the desired access type
-
-@par \ref fragment_iterator_concept
-   Types satisfying \ref fragment_iterator_concept define the following members
-  - <b>AccessType& operator[](int i)</b> - provides access to the ith element of the fragment
-  - <b>AccessType& at(int d, int h, int w, int c)</b> - applies \ref layout_concept to fragment and
-provides access to element at (d, h, w, c)
-
-@}
-*/
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <int kAlignment_>
-struct StorageType {
-  typedef uint64_t Type;
-};
-template <>
-struct StorageType<4> {
-  typedef uint32_t Type;
-};
-template <>
-struct StorageType<2> {
-  typedef uint16_t Type;
-};
-template <>
-struct StorageType<1> {
-  typedef uint8_t Type;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief A template defining \ref fragment_concept
-* @concept{fragment_concept}
-*/
-template <typename Element_, int kElements_, size_t kAlignment_ = 16>
-struct Fragment : public AlignedStruct<kAlignment_> {
-  /// Make sure the alignment makes sense wrt the size of elements.
-  static_assert(kAlignment_ == 16 || kAlignment_ >= sizeof(Element_), "Alignment is too small");
-  /// Alignment must be a power of two
-  static_assert(is_pow2<kAlignment_>::value, "Alignment must be a power of two");
-
-  /// This class.
-  typedef Fragment<Element_, kElements_> This_;
-  /// The element.
-  typedef Element_ Element;
-  /// The number of elements.
-  static int const kElements = kElements_;
-
-  /// Clear a fragment.
-  CUTLASS_DEVICE void clear() {
-    // Avoid element-wise access for sub 32b element type
-    if (kAlignment_ >= 8 && (kElements * sizeof(Element)) % 8 == 0) {
-      uint64_t* ptr = reinterpret_cast<uint64_t*>(storage);
-      for (int i = 0; i < (kElements * sizeof(Element)) / 8; ++i) {
-        ptr[i] = uint64_t(0);
-      }
-    } else if (kAlignment_ >= 4 && (kElements * sizeof(Element)) % 4 == 0) {
-      uint32_t* ptr = reinterpret_cast<uint32_t*>(storage);
-      for (int i = 0; i < (kElements * sizeof(Element)) / 4; ++i) {
-        ptr[i] = uint32_t(0);
-      }
-    } else if (kAlignment_ >= 2 && (kElements * sizeof(Element)) % 2 == 0) {
-      uint16_t* ptr = reinterpret_cast<uint16_t*>(storage);
-      for (int i = 0; i < (kElements * sizeof(Element)) / 2; ++i) {
-        ptr[i] = uint16_t(0);
-      }
-    } else {
-      for (int i = 0; i < kElements; ++i) {
-        storage[i] = 0;
-      }
-    }
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE Element& operator[](int i) {
-    assert(i < kElements_);
-    return reinterpret_cast<Element*>(storage)[i];
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE Element const& operator[](int i) const {
-    assert(i < kElements_);
-    return reinterpret_cast<Element const*>(storage)[i];
-  }
-
- private:
-  /// Storage type to use for Elements
-  typedef typename StorageType<kAlignment_>::Type StorageType;
-
-  /// Number of elements in the storage
-  static int const kStorageCount =
-      (sizeof(Element_) * kElements_ + sizeof(StorageType) - 1) / sizeof(StorageType);
-  /// The storage.
-  StorageType storage[kStorageCount];
-
-  /// Ensure that there's enough storage for all elements
-  static_assert(sizeof(StorageType) <= kAlignment_, "StorageType is too big for given alignment");
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief A template defining \ref fragment_iterator_concept
-* @concept{fragment_iterator_concept}
-*/
-template <typename Fragment_, typename Iterations_, typename AccessType_>
-struct FragmentIterator {
-  /// This class.
-  typedef FragmentIterator<Fragment_, Iterations_, AccessType_> This_;
-  /// The fragment.
-  typedef Fragment_ Fragment;
-  /// The number of iterations.
-  typedef Iterations_ Iterations;
-  /// The access type.
-  typedef AccessType_ AccessType;
-
-  /// The element.
-  typedef typename Fragment::Element Element;
-  /// The number of elements per access.
-  static int const kElementsPerAccess = (int)(sizeof(AccessType) / sizeof(Element));
-  /// The shape of the the fragment.
-  typedef typename ShapeMul<Iterations, Shape<1, 1, 1, kElementsPerAccess> >::Shape FragmentShape;
-  /// The linear strides for iterations.
-  typedef typename ShapeStrides<FragmentShape>::Shape Strides;
-
-  /// Ctor.
-  template <typename OtherFragment_>
-  CUTLASS_DEVICE FragmentIterator(OtherFragment_& fragment, int offset = 0)
-      : pointer(reinterpret_cast<Element*>(&fragment[offset])) {
-    static_assert(OtherFragment_::kElements >= Fragment::kElements, "");
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType const& at(int d, int h, int w, int c = 0) const {
-    int const imm = ComputeOffsetFromStrides<Strides>::get(d, h, w, c);
-    return reinterpret_cast<AccessType const&>(pointer[imm]);
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType& at(int d, int h, int w, int c = 0) {
-    int const imm = ComputeOffsetFromStrides<Strides>::get(d, h, w, c);
-    return reinterpret_cast<AccessType&>(pointer[imm]);
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType const& operator[](int i) const {
-    return reinterpret_cast<AccessType const&>(pointer[i * kElementsPerAccess]);
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType& operator[](int i) {
-    return reinterpret_cast<AccessType&>(pointer[i * kElementsPerAccess]);
-  }
-
-  /// Is the iterator valid?
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
-
-  /// The pointer.
-  Element* pointer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Fragment_, typename Iterations_, typename AccessType_>
-struct FragmentConstIterator {
-  /// This class.
-  typedef FragmentIterator<Fragment_, Iterations_, AccessType_> This_;
-  /// The fragment.
-  typedef Fragment_ Fragment;
-  /// The number of iterations.
-  typedef Iterations_ Iterations;
-  /// The access type.
-  typedef AccessType_ AccessType;
-
-  /// The element.
-  typedef typename Fragment::Element Element;
-  /// The number of elements per access.
-  static int const kElementsPerAccess = (int)(sizeof(AccessType) / sizeof(Element));
-  /// The shape of the the fragment.
-  typedef typename ShapeMul<Iterations, Shape<1, 1, 1, kElementsPerAccess> >::Shape FragmentShape;
-  /// The linear strides for iterations.
-  typedef typename ShapeStrides<FragmentShape>::Shape IterationsStrides;
-
-  /// Ctor.
-  template <typename OtherFragment_>
-  CUTLASS_DEVICE FragmentConstIterator(OtherFragment_& fragment, int offset = 0)
-      : pointer(reinterpret_cast<Element const*>(&fragment[offset])) {
-    static_assert(OtherFragment_::kElements >= Fragment::kElements, "");
-  }
-  /// Create from non-constant FragmentIterator
-  CUTLASS_DEVICE FragmentConstIterator(
-      FragmentIterator<Fragment_, Iterations_, AccessType_> const& rhs_)
-      : pointer(reinterpret_cast<Element const*>(rhs_.offset)) {}
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType const& at(int d, int h, int w, int c = 0) const {
-    int const imm = ComputeOffsetFromStrides<IterationsStrides>::get(d, h, w, c);
-    return reinterpret_cast<AccessType const&>(pointer[imm]);
-  }
-
-  /// The accessor.
-  CUTLASS_DEVICE AccessType const& operator[](int i) const {
-    return reinterpret_cast<AccessType const&>(pointer[i * kElementsPerAccess]);
-  }
-
-  /// Is the iterator valid?
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
-
-  /// The pointer.
-  Element const* pointer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/fragment_load_store.h
+++ b/cutlass/fragment_load_store.h
@ -1,135 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines accessors for loading and storing fragments to memory efficiently.
-*/
-#pragma once
-
-#include <cutlass/load_store.h>
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <IteratorFragment::Kind kIteratorFragment,
-          int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentLoad {};
-
-template <int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentLoad<IteratorFragment::kWmmaMatrix,
-                    kAccessSize,
-                    Scalar_,
-                    Memory_,
-                    FragmentElement_,
-                    kStride> {
-  /// The output type.
-  typedef FragmentElement_ AccessType;
-
-  /// The load function.
-  static CUTLASS_DEVICE void load(AccessType& value, Scalar_ const* pointer, int offset) {
-    value.load(&pointer[offset], kStride);
-  }
-};
-
-template <int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentLoad<IteratorFragment::kScalar,
-                    kAccessSize,
-                    Scalar_,
-                    Memory_,
-                    FragmentElement_,
-                    kStride> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, kAccessSize>::Type AccessType;
-
-  /// The load function.
-  static CUTLASS_DEVICE void load(AccessType& value, Scalar_ const* pointer, int offset) {
-    Load<Scalar_, kAccessSize, Memory_>::load(value, pointer, offset);
-  }
-};
-
-template <IteratorFragment::Kind kIteratorFragment,
-          int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentStore {};
-
-template <int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentStore<IteratorFragment::kWmmaMatrix,
-                     kAccessSize,
-                     Scalar_,
-                     Memory_,
-                     FragmentElement_,
-                     kStride> {
-  /// The input type.
-  typedef FragmentElement_ AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& value, Scalar_* pointer, int offset) {
-    value.store(&pointer[offset], kStride);
-  }
-};
-
-template <int kAccessSize,
-          typename Scalar_,
-          MemorySpace::Kind Memory_,
-          typename FragmentElement_,
-          int kStride>
-struct FragmentStore<IteratorFragment::kScalar,
-                     kAccessSize,
-                     Scalar_,
-                     Memory_,
-                     FragmentElement_,
-                     kStride> {
-  /// The input type.
-  typedef typename Vectorize<Scalar_, kAccessSize>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& value, Scalar_* pointer, int offset) {
-    Store<Scalar_, kAccessSize, Memory_>::store(value, pointer, offset);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  /// namespace cutlass
--- a/cutlass/fragment_multiply_add.h
+++ b/cutlass/fragment_multiply_add.h
@ -1,131 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines multiply-add operations on fragments within a thread.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_>
-struct FragmentMultiplyAdd {
-  /// The shape of the instruction.
-  typedef Shape<1, 1, 1, 1> InstructionShape;
-  /// The type for A.
-  typedef Scalar_ ScalarA;
-  /// The type for B.
-  typedef Scalar_ ScalarB;
-  /// The type for C and D.
-  typedef Scalar_ ScalarC;
-
-  /// Ctor.
-  CUTLASS_DEVICE FragmentMultiplyAdd() {}
-
-  /// Multiply : d = a*b.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void multiply(Scalar_ a, Fragment_ const& b, Fragment_& d) {
-    for (int j = 0; j < Fragment_::kElements; ++j) {
-      d[j] = a * b[j];
-    }
-  }
-
-  /// Multiply : d = a*b + c.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void multiply_add(Scalar_ a,
-                                   Fragment_ const& b,
-                                   Fragment_ const& c,
-                                   Fragment_& d) {
-    for (int j = 0; j < Fragment_::kElements; ++j) {
-      d[j] = a * b[j] + c[j];
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
-template <>
-struct FragmentMultiplyAdd<half> {
-  /// The shape of the instruction.
-  typedef Shape<1, 1, 1, 1> InstructionShape;
-  /// The type for A.
-  typedef half ScalarA;
-  /// The type for B.
-  typedef half ScalarB;
-  /// The type for C and D.
-  typedef half ScalarC;
-
-  /// Ctor.
-  CUTLASS_DEVICE FragmentMultiplyAdd() {}
-
-  /// Multiply : d = a*b.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void multiply(half a, Fragment_ const& b, Fragment_& d) {
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
-    // The input.
-    __half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
-    // The output.
-    __half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
-
-    // Assemble a half2 from a.
-    __half2 const a_half2 = __half2half2(a);
-
-    for (int i = 0; i < Fragment_::kElements / 2; ++i) {
-      d_half2[i] = __hmul2(a_half2, b_half2[i]);
-    }
-#endif
-  }
-
-  /// Multiply : d = a*b + c.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void multiply_add(half a, Fragment_ const& b, Fragment_ const& c, Fragment_& d) {
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
-    // The inputs.
-    __half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
-    __half2 const* c_half2 = reinterpret_cast<__half2 const*>(&c[0]);
-    // The output.
-    __half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
-
-    // Assemble a half2 from a.
-    __half2 const a_half2 = __half2half2(a);
-
-    for (int i = 0; i < Fragment_::kElements / 2; ++i) {
-      d_half2[i] = __hfma2(a_half2, b_half2[i], c_half2[i]);
-    }
-#endif
-  }
-};
-
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/clear_accumulators.h
+++ b/cutlass/gemm/clear_accumulators.h
@ -1,55 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines abstractions for efficiently clearing accumulator tiles.
-*/
-#pragma once
-
-#include <cutlass/vector.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int kLanes_ = 1>
-struct ClearAccumulators {
-  /// The shared storage.
-  struct SharedStorage {};
-
-  /// Ctor.
-  CUTLASS_DEVICE ClearAccumulators(SharedStorage& shared_storage) {}
-
-  /// Clear the fragment.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void clear(Fragment_& fragment) {
-    fragment.clear();
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/dgemm_traits.h
+++ b/cutlass/gemm/dgemm_traits.h
@ -1,127 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines structural traits of double-precision GEMM.
-*/
-#pragma once
-
-#include <cutlass/gemm/gemm.h>
-#include <cutlass/gemm/gemm_epilogue.h>
-#include <cutlass/gemm/gemm_epilogue_traits.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/gemm/gemm_shared_tile.h>
-#include <cutlass/gemm/gemm_traits.h>
-#include <cutlass/gemm/thread_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_ = 1,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_ = 1>
-struct DgemmConfig
-    : public GemmConfig<
-          /// The scalar type for A.
-          double,
-          /// The scalar type for B.
-          double,
-          /// The scalar type for C.
-          double,
-          /// The scalar type for D.
-          double,
-          /// The tile size for the GEMM KxNxM.
-          OutputTile_,
-          /// The functor to do the math in the main loop.
-          ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, double, double, double>,
-          /// The number of scalars per LDG for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per STS for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per LDS for A.
-          2,
-          /// The number of scalars per LDG for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per STS for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per LDS for B.
-          2,
-          /// The number of scalars per LDG for C and STG for D.
-          1,
-          /// The number of scalars per STS for D.
-          2,
-          /// The number of scalars per LDS for D.
-          1,
-          /// The number of stages in shared memory.
-          2> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_ = Shape<8, 64, 128>,
-    /// The functor to use in the epilogue.
-    typename EpilogueFunctor_ = LinearScaling<double>,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<8, 8, 8>,
-    /// The number of doubles loaded in one LDG for A.
-    int kScalarsPerLdgA_ = 1,
-    /// The number of doubles loaded in one LDG for B.
-    int kScalarsPerLdgB_ = 1,
-    /// The index.
-    typename Index_ = int,
-    /// The DGEMM config.
-    typename GemmConfig_ =
-        DgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>,
-    /// The traits class for the epilogue.
-    typename GemmEpilogueTraits_ =
-        SimplifiedGemmEpilogueTraits<GemmConfig_, EpilogueFunctor_, Index_> >
-struct DgemmTraits : public SimplifiedGemmTraits<
-                         // The layout for A.
-                         kLayoutA_,
-                         // The layout for B.
-                         kLayoutB_,
-                         // The config.
-                         GemmConfig_,
-                         // The epilogue.
-                         GemmEpilogue<GemmEpilogueTraits_>,
-                         // The index.
-                         Index_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm.h
+++ b/cutlass/gemm/gemm.h
@ -1,319 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements a software-pipelined efficient GEMM.
-*/
-#pragma once
-
-#if !defined(__CUDACC_RTC__)
-#include <cuda.h>
-#endif
-
-#include <cutlass/coord.h>
-#include <cutlass/util/platform.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Gemm_>
-__global__ void gemm_kernel(typename Gemm_::Params params) {
-  // Declare shared memory.
-  __shared__ typename Gemm_::SharedStorage shared_storage;
-
-  // Construct the GEMM object.
-  Gemm_ gemm(params, shared_storage);
-  // Run GEMM.
-  gemm.multiply_add();
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Index_ = int>
-struct GemmDesc {
-  /// The dimensions of the GEMM.
-  Index_ m, n, k;
-  /// The alpha/beta scaling values.
-  Scalar_ alpha, beta;
-  /// The source matrix A.
-  void const* d_a;
-  /// The stride for A.
-  Index_ lda;
-  /// The source matrix B.
-  void const* d_b;
-  /// The stride for B.
-  Index_ ldb;
-  /// The source matrix C.
-  void const* d_c;
-  /// The stride for C.
-  Index_ ldc;
-  /// The destination matrix D.
-  void* d_d;
-  /// The stride for D.
-  Index_ ldd;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmTraits_>
-struct Gemm {
-  /// This class.
-  typedef Gemm<GemmTraits_> This_;
-  /// The traits.
-  typedef GemmTraits_ Traits;
-  /// The shared storage.
-  typedef typename Traits::SharedStorage SharedStorage;
-
-  /// The scalar for A.
-  typedef typename Traits::ScalarA ScalarA;
-  /// The scalar for B.
-  typedef typename Traits::ScalarB ScalarB;
-  /// The scalar in the epilogue.
-  typedef typename Traits::Epilogue::Scalar ScalarEpilogue;
-  /// The scalar for C.
-  typedef typename Traits::Epilogue::ScalarC ScalarC;
-  /// The scalar for D.
-  typedef typename Traits::Epilogue::ScalarD ScalarD;
-  /// The index.
-  typedef typename Traits::Index Index;
-
-  /// The number of threads.
-  static int const kThreads = Traits::GemmConfig::kThreads;
-
-  /// The params.
-  struct Params : public Traits::Params {
-    CUTLASS_HOST_DEVICE int initialize(Index m,
-                                       Index n,
-                                       Index k,
-                                       ScalarEpilogue alpha,
-                                       ScalarA const* d_a,
-                                       Index lda,
-                                       ScalarB const* d_b,
-                                       Index ldb,
-                                       ScalarEpilogue beta,
-                                       ScalarC const* d_c,
-                                       Index ldc,
-                                       ScalarD* d_d,
-                                       Index ldd) {
-      GemmDesc<ScalarEpilogue, Index> desc;
-      desc.m = m;
-      desc.n = n;
-      desc.k = k;
-      desc.alpha = alpha;
-      desc.beta = beta;
-      desc.d_a = reinterpret_cast<void const*>(d_a);
-      desc.lda = lda;
-      desc.d_b = reinterpret_cast<void const*>(d_b);
-      desc.ldb = ldb;
-      desc.d_c = reinterpret_cast<void const*>(d_c);
-      desc.ldc = ldc;
-      desc.d_d = reinterpret_cast<void*>(d_d);
-      desc.ldd = ldd;
-      return Traits::Params::initialize(desc);
-    }
-  };
-
-#if !defined(__CUDACC_RTC__)
-  /// Launch the kernel.
-  static __host__ cudaError_t launch(Params const& params,
-                                     cudaStream_t stream = cudaStreamDefault) {
-    // Setup the grid.
-    dim3 grid;
-    grid.x = (params.m + Traits::OutputTile::kW - 1) / Traits::OutputTile::kW;
-    grid.y = (params.n + Traits::OutputTile::kH - 1) / Traits::OutputTile::kH;
-
-    // The number of threads.
-    dim3 block;
-    block.x = kThreads;
-
-    // Launch the kernel.
-    void const* params_ = reinterpret_cast<void const*>(&params);
-
-    return cudaLaunchKernel(reinterpret_cast<void*>(&gemm_kernel<This_>),
-                            grid,
-                            block,
-                            const_cast<void**>(&params_),
-                            0,
-                            stream);
-  }
-
-  /// Launch the kernel.
-  static __host__ cudaError_t launch(CUfunction kernel,
-                                     Params const& params,
-                                     CUstream stream = CU_STREAM_LEGACY) {
-    // Setup the grid.
-    dim3 grid;
-    grid.x = (params.m + Traits::OutputTile::kW - 1) / Traits::OutputTile::kW;
-    grid.y = (params.n + Traits::OutputTile::kH - 1) / Traits::OutputTile::kH;
-
-    // The number of threads.
-    dim3 block;
-    block.x = kThreads;
-
-    // Launch the kernel.
-    void* params_[] = {const_cast<void*>(reinterpret_cast<void const*>(&params))};
-
-    // return cudaLaunchKernel(reinterpret_cast<void*>(&gemm_kernel<This_>), grid, block,
-    //  const_cast<void**>(&params_), 0, stream);
-    CUresult result = cuLaunchKernel(
-        kernel, grid.x, grid.y, grid.z, block.x, block.y, block.z, 0, stream, params_, 0);
-
-    if (result != CUDA_SUCCESS) {
-      return cudaErrorLaunchFailure;
-    }
-    return cudaSuccess;
-  }
-
-#endif
-
-  /// Ctor.
-  CUTLASS_DEVICE Gemm(Params const& params_, SharedStorage& shared_storage_)
-      : params(params_), shared_storage(shared_storage_) {}
-
-  /// Do the GEMM.
-  CUTLASS_DEVICE void multiply_add() {
-    // Swizzle the IDs of the block (to enable better cache behavior).
-    typename Traits::BlockSwizzle block_swizzle;
-    dim3 block = block_swizzle.swizzle();
-
-    // Scale the id.
-    block.x *= Traits::OutputTile::kW;
-    block.y *= Traits::OutputTile::kH;
-
-    // We may want to use shared memory to clear the registers.
-    typedef typename Traits::ClearAccumulators ClearAccumulators;
-
-    // The streams to read A/B from global memory to shared memory.
-    typename Traits::GlobalLoadStream global_stream(params, shared_storage, block);
-
-    // Create the accumulator clear.
-    ClearAccumulators clear(shared_storage.main_loop.clear);
-
-    /// Define the mainloop iteration size
-    typedef typename Traits::MultiplyAdd MultiplyAdd;
-
-    // By how much we unroll the main loop.
-    Index const kUnroll = static_cast<Index>(MultiplyAdd::AccumulatorsPerWarp::kD);
-
-    // If we do not have enough steps in the main loop, trigger the residue code.
-    if (params.k < kUnroll) {
-      global_stream.residue(params.k, true);
-    }
-
-    // Fetch the fragments for A and B from global memory.
-    global_stream.copy();
-
-    // Copy the elements to shared memory (after transformation if needed).
-    global_stream.commit();
-
-    // Make sure the data is in shared memory.
-    Traits::shared_store_fence(false);
-
-    // The unrolling steps for the main loop.
-    int const kUnrollingSteps =
-        MultiplyAdd::AccumulatorsPerWarp::kD / MultiplyAdd::InstructionShape::kD;
-
-    // Make sure we have at least 2 unrolling steps or our pipeling is not going to work.
-    static_assert(kUnrollingSteps >= 2, "The pipelining assumes at least two steps");
-
-    // The stream of data from shared memory to fragments.
-    typename Traits::SharedLoadStream shared_load_stream(params, shared_storage);
-
-    // Trigger the copy from shared memory for the 1st stream.
-    shared_load_stream.copy(0);
-
-    // Allocate the accumulators.
-    typename MultiplyAdd::Accumulators accumulators;
-    // Clear the accumulators.
-    clear.clear(accumulators);
-
-    // Enter the main loop and iterate.
-    typedef typename Traits::Index Index;
-    for (Index outer_k = params.k - kUnroll; outer_k > -kUnroll; outer_k -= kUnroll) {
-      // If that's the last "load iteration" update the predicates.
-      int const is_residue = outer_k <= kUnroll;
-      if (is_residue) {
-        global_stream.residue(outer_k);
-      }
-
-      // Load data for the next iteration of the main loop.
-      global_stream.copy();
-
-      CUTLASS_PRAGMA_UNROLL
-      for (int step = 0; step < kUnrollingSteps - 1; ++step) {
-        // Trigger the copy from shared memory for the next A/B values.
-        shared_load_stream.copy(step + 1);
-        // Make sure the values are available for the current iteration to do the multiply-add.
-        shared_load_stream.commit(step);
-
-        // Do the math on the fragments of the current iteration.
-        MultiplyAdd multiply_add;
-        multiply_add.multiply_add(shared_load_stream.fragment_a(step),
-                                  shared_load_stream.fragment_b(step),
-                                  accumulators,
-                                  accumulators);
-      }
-
-      // Make sure the data from shared memory has been entirely consumed.
-      Traits::shared_load_fence(true);
-
-      // Commit the data in shared memory for A/B.
-      global_stream.commit();
-
-      // Make sure the data is in shared memory.
-      Traits::shared_store_fence(true);
-
-      // Move to the next stage for the load (if it makes sense).
-      shared_load_stream.inc_stage();
-      // Trigger the copy from shared memory for the next loop iteration.
-      shared_load_stream.copy(0);
-      // Make sure the values are available for the current iteration to do the multiply-add.
-      shared_load_stream.commit(kUnrollingSteps - 1);
-
-      // Do the math on the fragments of the current iteration.
-      MultiplyAdd multiply_add;
-      multiply_add.multiply_add(shared_load_stream.fragment_a(kUnrollingSteps - 1),
-                                shared_load_stream.fragment_b(kUnrollingSteps - 1),
-                                accumulators,
-                                accumulators);
-    }
-
-    // Epilogue.
-    typedef typename Traits::Epilogue Epilogue;
-    Epilogue epilogue(params.epilogue, shared_storage.epilogue, params.m, params.n);
-    epilogue.epilogue(cutlass::make_Coord(0, block.y, block.x), accumulators);
-  }
-
-  /// The params.
-  Params const& params;
-  /// The shared storage.
-  SharedStorage& shared_storage;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_epilogue.h
+++ b/cutlass/gemm/gemm_epilogue.h
@ -1,225 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements the epilogue phase of the GEMM kernel that efficiently updates global memory
-   with
-      the computed matrix product.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/coord.h>
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename T>
-CUTLASS_DEVICE bool is_zero(T x) {
-  return x == T(0);
-}
-
-#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
-CUTLASS_DEVICE bool is_zero(half x) { return reinterpret_cast<int16_t&>(x) == int16_t(0); }
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmEpilogueTraits_>
-struct GemmEpilogue {
-  /// The traits class.
-  typedef GemmEpilogueTraits_ Traits;
-  /// The params.
-  typedef typename Traits::Params Params;
-  /// The shared storage.
-  typedef typename Traits::SharedStorage SharedStorage;
-
-  /// The output tile.
-  typedef typename Traits::OutputTile OutputTile;
-  /// The number of iterations.
-  typedef typename Traits::Iterations Iterations;
-  /// The accumulators.
-  typedef typename Traits::Accumulators Accumulators;
-  /// The scalar.
-  typedef typename Traits::Scalar Scalar;
-  /// The functor in charge of the math.
-  typedef typename Traits::Functor Functor;
-
-  /// We do not support 3D or 4D shapes.
-  static_assert(Iterations::kD == 1 && Iterations::kC == 1, "Unsupported 3D/4D shapes");
-
-  /// The iterator for C in global memory.
-  typedef typename Traits::GlobalLoadIteratorC GlobalLoadIteratorC;
-  /// The transformer for C.
-  typedef typename Traits::GlobalTransformerC GlobalTransformerC;
-  /// The transformer for D.
-  typedef typename Traits::GlobalTransformerD GlobalTransformerD;
-  /// The iterator for D in global memory.
-  typedef typename Traits::GlobalStoreIteratorD GlobalStoreIteratorD;
-  /// The iterator to store D in shared memory.
-  typedef typename Traits::SharedStoreIteratorD SharedStoreIteratorD;
-  /// The shared store transformer for D.
-  typedef typename Traits::SharedStoreTransformerD SharedStoreTransformerD;
-  /// The iterator to load D in shared memory.
-  typedef typename Traits::SharedLoadIteratorD SharedLoadIteratorD;
-  /// The shared load transformer for D.
-  typedef Copy<typename SharedLoadIteratorD::Fragment> SharedLoadTransformerD;
-
-  /// The index.
-  typedef typename Traits::Index Index;
-
-  /// The scalar for C.
-  typedef typename GlobalLoadIteratorC::Scalar ScalarC;
-  /// The scalar for D.
-  typedef typename GlobalStoreIteratorD::Scalar ScalarD;
-
-  /// Ctor.
-  CUTLASS_DEVICE GemmEpilogue(Params const& params_,
-                              SharedStorage& shared_storage_,
-                              Index m_,
-                              Index n_)
-      : params(params_), shared_storage(shared_storage_), m(m_), n(n_) {}
-
-  /// Execute the epilogue.
-  CUTLASS_DEVICE void epilogue(Coord<3> const& block, Accumulators& accumulators) {
-    if (is_zero(params.functor.beta)) {
-      epilogue_with_or_without_beta<true>(block, accumulators);
-    } else {
-      epilogue_with_or_without_beta<false>(block, accumulators);
-    }
-  }
-
-  template <bool kBetaIsZero_>
-  CUTLASS_DEVICE void epilogue_with_or_without_beta(Coord<3> const& block,
-                                                    Accumulators& accumulators) {
-
-    Coord<3> const bounds = cutlass::make_Coord(0, n, m);
-
-    // The functor.
-    Functor functor(params.functor);
-    // The C fragment.
-    typename GlobalLoadIteratorC::Fragment fragment_c;
-    // The transformed C fragment.
-    typename GlobalTransformerC::OutputFragment transformed_c;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int h = 0; h < Iterations::kH; ++h) {
-      // Compute pointer and predicate offsets for C and D global iterators.
-      int const pointer_offset =
-          ((params.iterator_d.inc_h * (GlobalStoreIteratorD::Iterations::kH - 1) +
-            params.iterator_d.inc_advance) *
-               Iterations::kW +
-           params.stride_h) *
-          h;
-      int const predicate_offset =
-          ((params.iterator_d.predicate_inc_h * (GlobalStoreIteratorD::Iterations::kH - 1) +
-            params.iterator_d.predicate_inc_advance) *
-               Iterations::kW +
-           Traits::Delta::kH) *
-          h;
-
-      // The iterator to load the elements of the C matrix.
-      GlobalLoadIteratorC global_load_iterator(
-          params.iterator_c, bounds, block, pointer_offset, predicate_offset);
-      // The transformer for C.
-      GlobalTransformerC transformer_c;
-      // The transformer for D.
-      GlobalTransformerD transformer_d;
-      // The iterator to store into the D matrix.
-      GlobalStoreIteratorD global_store_iterator(
-          params.iterator_d, bounds, block, pointer_offset, predicate_offset);
-
-      CUTLASS_PRAGMA_UNROLL
-      for (int w = 0; w < Iterations::kW; ++w) {
-        // Load the C matrix into fragment.
-        if (!kBetaIsZero_) {
-          iterator_load(global_load_iterator, fragment_c);
-        }
-
-        // Make sure we can write to shared memory.
-        shared_load_fence();
-
-        // Copy the accumulators to shared memory.
-        int const offset = (h * Iterations::kW + w) * SharedStoreIteratorD::Fragment::kElements;
-
-        SharedStoreTransformerD shared_store_transformer;
-        typename SharedStoreTransformerD::OutputFragment shared_store_transformed_d;
-        shared_store_transformer.transform(accumulators, offset, shared_store_transformed_d);
-
-        SharedStoreIteratorD shared_store_iterator(params.shared_store_iterator_d,
-                                                   shared_storage.shared_stream.store);
-        shared_iterator_store(shared_store_iterator, shared_store_transformed_d);
-
-        // Make sure the data is in shared memory.
-        shared_store_fence();
-
-        // Copy the accumulators back to registers from shared memory.
-        SharedLoadIteratorD shared_load_iterator(params.shared_load_iterator_d,
-                                                 shared_storage.shared_stream.load);
-        typename SharedLoadIteratorD::Fragment fetched_d;
-        shared_iterator_load(shared_load_iterator, fetched_d);
-
-        // Do the math.
-        typename GlobalTransformerD::InputFragment fragment_d;
-
-        if (kBetaIsZero_) {
-          functor.evaluate(fetched_d, fragment_d);
-        } else {
-          // Transform C fragment.
-          transformer_c.transform(fragment_c, transformed_c);
-          // Do the math.
-          functor.evaluate(fetched_d, transformed_c, fragment_d);
-        }
-
-        // Transform D fragment.
-        typename GlobalTransformerD::OutputFragment transformed_d;
-        transformer_d.transform(fragment_d, transformed_d);
-
-        // Copy the results to global memory.
-        iterator_store(global_store_iterator, transformed_d);
-      }
-    }
-  }
-
-  /// The memory fence for shared loads.
-  CUTLASS_DEVICE void shared_load_fence() { __syncthreads(); }
-
-  /// The memory fence for shared stores.
-  CUTLASS_DEVICE void shared_store_fence() { __syncthreads(); }
-
-  /// The params.
-  Params const& params;
-  /// The shared storage.
-  SharedStorage& shared_storage;
-  /// The dimensions of the GEMM.
-  Index m, n;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_epilogue_traits.h
+++ b/cutlass/gemm/gemm_epilogue_traits.h
@ -1,331 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines structural properties of the GEMM epilogue.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/coord.h>
-#include <cutlass/gemm/gemm_global_stream.h>
-#include <cutlass/gemm/gemm_shared_stream.h>
-#include <cutlass/gemm/linear_scaling.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The output tile.
-    typename OutputTile_,
-    /// The accumulators.
-    typename Accumulators_,
-    /// The iterator to load C from global memory.
-    typename GlobalLoadIteratorC_,
-    /// The transformer for C.
-    typename GlobalTransformerC_,
-    /// The transformer for D.
-    typename GlobalTransformerD_,
-    /// The iterator to store D to global memory.
-    typename GlobalStoreIteratorD_,
-    /// The iterator to store D to shared memory.
-    typename SharedStoreIteratorD_,
-    /// The shared store transformer for D.
-    typename SharedStoreTransformerD_,
-    /// The iterator to load D from shared memory.
-    typename SharedLoadIteratorD_,
-    /// The number of iterations in the epilogue.
-    typename Iterations_,
-    /// The iterations strides.
-    typename Delta_,
-    /// The functor to be used in the epilogue.
-    typename Functor_,
-    /// The index.
-    typename Index_ = int>
-struct GemmEpilogueTraits {
-  //
-  /// The output tile.
-  typedef OutputTile_ OutputTile;
-  /// The number of iterations.
-  /// The accumulators.
-  typedef Accumulators_ Accumulators;
-  /// The iterator for C in global memory.
-  typedef GlobalLoadIteratorC_ GlobalLoadIteratorC;
-  /// The transformer for C.
-  typedef GlobalTransformerC_ GlobalTransformerC;
-  /// The transformer for D.
-  typedef GlobalTransformerD_ GlobalTransformerD;
-  /// The iterator for D in global memory.
-  typedef GlobalStoreIteratorD_ GlobalStoreIteratorD;
-  /// The iterator to store D in shared memory.
-  typedef SharedStoreIteratorD_ SharedStoreIteratorD;
-  /// The shared store transformer for D.
-  typedef SharedStoreTransformerD_ SharedStoreTransformerD;
-  /// The iterator to store D in shared memory.
-  typedef SharedLoadIteratorD_ SharedLoadIteratorD;
-  /// typedef typename GemmConfig::EpilogueIterations Iterations;
-  typedef Iterations_ Iterations;
-  /// The iterations strides.
-  typedef Delta_ Delta;
-
-  /// The functor in charge of the math.
-  typedef Functor_ Functor;
-  /// The index.
-  typedef Index_ Index;
-
-  /// We do not support 3D or 4D shapes.
-  static_assert(Iterations::kD == 1 && Iterations::kC == 1, "Unsupported 3D/4D shapes");
-
-  /// The scalar.
-  typedef typename Functor::Scalar Scalar;
-  /// The scalar for C.
-  typedef typename GlobalLoadIteratorC::Scalar ScalarC;
-  /// The scalar for D.
-  typedef typename GlobalStoreIteratorD::Scalar ScalarD;
-
-  /// The params.
-  struct Params {
-    /// The strides for H and W in the different iterations of the epilogue.
-    Index stride_h, stride_w;
-    /// The params for the C iterator.
-    typename GlobalLoadIteratorC::Params iterator_c;
-    /// The params for the D global iterator.
-    typename GlobalStoreIteratorD::Params iterator_d;
-    /// The params for the D shared store iterator.
-    typename SharedStoreIteratorD::Params shared_store_iterator_d;
-    /// The params for the D shared load iterator.
-    typename SharedLoadIteratorD::Params shared_load_iterator_d;
-    /// The functor params.
-    typename Functor::Params functor;
-
-    /// Setup the params.
-    template <typename GemmDesc_>
-    CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
-      // The parameters for the functor.
-      int error_code = functor.initialize(desc);
-      if (error_code) {
-        return error_code;
-      }
-
-      // At the end of the H iteration, we jump over a number of columns.
-      this->stride_h = desc.ldd * Delta::kH;
-      // Nothing to do here.
-      this->stride_w = 0;
-
-      // Setup the params for the global memory iterator for C.
-      error_code = iterator_c.initialize(
-          reinterpret_cast<ScalarC const*>(desc.d_c), desc.ldc, desc.n, stride_w, Delta::kW);
-      if (error_code) {
-        return error_code;
-      }
-
-      // Setup the params for the global memory iterator for D.
-      return iterator_d.initialize(
-          reinterpret_cast<ScalarD*>(desc.d_d), desc.ldd, desc.n, stride_w, Delta::kW);
-    }
-  };
-
-  /// The shared memory storage to exchange data.
-  union StreamSharedStorage {
-    // The storage for the store iterator.
-    typename SharedStoreIteratorD::SharedStorage store;
-    // The storage for the store iterator.
-    typename SharedLoadIteratorD::SharedStorage load;
-  };
-
-  /// The shared memory to swizzle the data in the epilogue.
-  struct SharedStorage {
-    // The storage for the shared stream D.
-    StreamSharedStorage shared_stream;
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_, typename EpilogueFunctor_, typename Index_ = int>
-struct GemmEpilogueTraitsHelper {
-  /// The scalar.
-  typedef typename EpilogueFunctor_::Scalar Scalar;
-  /// The output tile.
-  typedef typename GemmConfig_::OutputTile OutputTile;
-
-  /// The number of iterations in the epilogue.
-  typedef Shape<1,
-                GemmConfig_::MultiplyAdd::AccumulatorsPerThread::kH /
-                    GemmConfig_::kAccumulatorsPerLdsB,
-                GemmConfig_::kAccumulatorsPerLdsB>
-      Iterations;
-  // The iteration strides in the H/W dimension.
-  typedef Shape<0,
-                GemmConfig_::kAccumulatorsPerLdsB*(
-                    GemmConfig_::Warps::kH* GemmConfig_::MultiplyAdd::ThreadsPerWarp::kH - 1),
-                0>
-      Delta;
-  /// The functor to do the math in the epilogue.
-  typedef EpilogueFunctor_ Functor;
-
-  /// The traits class to build the iterator to store to shared memory for D.
-  typedef GemmSharedStoreTileDTraits<
-      // The pointer is float.
-      typename Functor::Scalar,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The number of scalars per STS.
-      GemmConfig_::kScalarsPerStsD,
-      // The skew -- 128 / sizeof(ScalarD) / kScalarsPerStsD is the number of threads involved in
-      // a single STS. We divide by 2 as our objective is to add a skew to the odd threads to
-      // avoid bank conflicts between odd and even threads.
-      128 / sizeof(typename GemmConfig_::ScalarD) / GemmConfig_::kScalarsPerStsD / 2 *
-          GemmConfig_::kScalarsPerStsD>
-      SharedStoreTileTraits;
-
-  /// The iterator to store D to shared memory.
-  typedef TileStoreIterator<SharedStoreTileTraits,
-                            typename SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorD;
-
-  /// The shared store transformer for D.
-  typedef Copy<typename SharedStoreIteratorD::Fragment> SharedStoreTransformerD;
-
-  /// The traits class to build the iterator to load from shared memory for D.
-  typedef GemmSharedLoadTileDTraits<
-      // The pointer is float.
-      typename Functor::Scalar,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The number of columns of the output tile written by iteration.
-      GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsD,
-      // The skew.
-      SharedStoreTileTraits::kSkew>
-      SharedLoadTileTraits;
-
-  /// The iterator to load D from shared memory.
-  typedef TileLoadIterator<SharedLoadTileTraits,
-                           typename SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorD;
-
-  /// The traits class to build the iterator to load data from global memory for C^N.
-  typedef GemmGlobalTileCdTraits<
-      // The pointer is float const.
-      typename GemmConfig_::ScalarC const,
-      // The tile has size (N / Iterations)xM in GEMM's terminology.
-      Shape<1,
-            GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
-            GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // How many elements do we jump over at each iteration?
-      Iterations::kW,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgC>
-      GlobalLoadTileTraits;
-
-  /// The iterator to load C.
-  typedef GemmGlobalIteratorCd<GlobalLoadTileTraits, Index_> GlobalLoadIteratorC;
-  /// The transformer for C.
-  typedef Copy<typename GlobalLoadIteratorC::Fragment> GlobalTransformerC;
-
-  /// The traits class to build the iterator to store data to global memory for D^N.
-  typedef GemmGlobalTileCdTraits<
-      // The pointer is float.
-      typename GemmConfig_::ScalarD,
-      // The tile has size (N / Iterations)xM in GEMM's terminology.
-      Shape<1,
-            GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
-            GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // How many elements do we jump over at each iteration?
-      Iterations::kW,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerStgD>
-      GlobalStoreTileTraits;
-
-  /// The iterator to store D.
-  typedef GemmGlobalIteratorCd<GlobalStoreTileTraits, Index_> GlobalStoreIteratorD;
-  /// The transformer for D.
-  typedef Copy<typename GlobalStoreIteratorD::Fragment> GlobalTransformerD;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The GEMM config.
-    typename GemmConfig_,
-    /// The epilogue functor to do the math in the epilogue.
-    typename EpilogueFunctor_,
-    /// The index.
-    typename Index_ = int,
-    /// The helper to create the traits class.
-    typename Helper_ = GemmEpilogueTraitsHelper<GemmConfig_, EpilogueFunctor_, Index_> >
-struct SimplifiedGemmEpilogueTraits : public GemmEpilogueTraits<
-                                          // The output tile.
-                                          typename GemmConfig_::OutputTile,
-                                          // The accumulators.
-                                          typename GemmConfig_::Accumulators,
-                                          // The global iterator for C.
-                                          typename Helper_::GlobalLoadIteratorC,
-                                          // The transformer for C.
-                                          typename Helper_::GlobalTransformerC,
-                                          // The transformer for D.
-                                          typename Helper_::GlobalTransformerD,
-                                          // The global iterator for D.
-                                          typename Helper_::GlobalStoreIteratorD,
-                                          // The iterator to store D to shared memory.
-                                          typename Helper_::SharedStoreIteratorD,
-                                          // The shared store transformer for D.
-                                          typename Helper_::SharedStoreTransformerD,
-                                          // The iterator to load D from shared memory.
-                                          typename Helper_::SharedLoadIteratorD,
-                                          // The number of iterations.
-                                          typename Helper_::Iterations,
-                                          // The strides between iterations.
-                                          typename Helper_::Delta,
-                                          // The functor to be used in the epilogue.
-                                          EpilogueFunctor_,
-                                          // The index.
-                                          Index_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_global_stream.h
+++ b/cutlass/gemm/gemm_global_stream.h
@ -1,175 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements efficient loading of the thread block-level tile from global memory and
-   storing
-      to shared memory.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/iterator_access.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The load iterator.
-    typename LoadIterator_,
-    /// The store iterator to copy to shared memory.
-    typename StoreIterator_,
-    /// The transformer to be applied after the data has been copied from global memory.
-    typename Transformer_>
-
-struct GlobalLoadStreamBase {
-  /// The load iterator.
-  typedef LoadIterator_ LoadIterator;
-  /// The transformer.
-  typedef Transformer_ Transformer;
-  /// The store iterator to write to shared memory.
-  typedef StoreIterator_ StoreIterator;
-
-  /// The fragment that is copied from shared memory.
-  typedef typename LoadIterator::Fragment FetchedFragment;
-  /// The fragment that is obtained after the transformation by the transformer.
-  typedef typename Transformer::OutputFragment TransformedFragment;
-  /// Make sure the fragments match.
-  static_assert((platform::is_same<FetchedFragment, typename Transformer::InputFragment>::value),
-                "");
-  /// The output fragment.
-  typedef TransformedFragment Fragment;
-  /// Make sure the transformed fragment is the same as the store fragment.
-  static_assert((platform::is_same<TransformedFragment, typename StoreIterator::Fragment>::value),
-                "");
-
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = LoadIterator::kLayout;
-  /// The scalar type of the iterator.
-  typedef typename LoadIterator::Scalar Scalar;
-  /// The pointer.
-  typedef typename LoadIterator::Pointer Pointer;
-  /// The index.
-  typedef typename LoadIterator::Index Index;
-
-  /// The params.
-  struct Params {
-    // The load iterator.
-    typename LoadIterator::Params load_iterator;
-    // The store iterator.
-    typename StoreIterator::Params store_iterator;
-
-    /// Setup the params.
-    CUTLASS_HOST_DEVICE int initialize(Pointer pointer, Index ld) {
-      int error_code = load_iterator.initialize(pointer, ld);
-      if (error_code) {
-        return error_code;
-      }
-
-      return store_iterator.initialize();
-    }
-  };
-
-  /// The amount of storage in shared memory needed to store the tile.
-  typedef typename StoreIterator::SharedStorage SharedStoreStorage;
-
-  /// The storage in shared memory needed by that stream.
-  union SharedStorage {
-    // The load iterator.
-    typename LoadIterator::SharedStorage load_iterator;
-    // The store iterator.
-    SharedStoreStorage store_iterator;
-  };
-
-  /// Ctor.
-  CUTLASS_DEVICE GlobalLoadStreamBase(Params const& params,
-                                      SharedStorage& shared_storage,
-                                      Coord<3> const bounds,
-                                      Coord<3> const& block)
-      : load_iterator(params.load_iterator, bounds, block),
-        transformer(),
-        store_iterator(params.store_iterator, shared_storage.store_iterator)
-
-  {
-    fetched_fragment.clear();
-  }
-
-  /// Load the data from shared memory to the fetch fragment.
-  CUTLASS_DEVICE void copy() { iterator_load(load_iterator, fetched_fragment); }
-
-  /// Commit the data.
-  CUTLASS_DEVICE void commit() {
-    transformer.transform(fetched_fragment, transformed_fragment);
-    iterator_store(store_iterator, transformed_fragment);
-    store_iterator.inc_stage();
-  }
-
-  /// Execute the residue code.
-  CUTLASS_DEVICE void residue(Index k, bool skip_clear = false) {
-    load_iterator.residue(k);
-    if (!skip_clear) {
-      fetched_fragment.clear();
-    }
-  }
-
-  /// The iterator.
-  LoadIterator load_iterator;
-  /// The fragment to fetch from shared memory.
-  FetchedFragment fetched_fragment;
-  /// The transformer.
-  Transformer transformer;
-  /// The fragment to convert the data after it has been fetched from shared memory.
-  TransformedFragment transformed_fragment;
-  /// The store iterator.
-  StoreIterator store_iterator;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The load iterator.
-    typename LoadIterator_,
-    /// The store iterator to copy to shared memory.
-    typename StoreIterator_,
-    /// The transformer to be applied after the data has been copied from global memory.
-    typename Transformer_ = Copy<typename LoadIterator_::Fragment> >
-
-struct GlobalLoadStream : public GlobalLoadStreamBase<LoadIterator_, StoreIterator_, Transformer_> {
-  /// The base class.
-  typedef GlobalLoadStreamBase<LoadIterator_, StoreIterator_, Transformer_> Base;
-
-  /// Ctor.
-  CUTLASS_DEVICE GlobalLoadStream(typename Base::Params const& params,
-                                  typename Base::SharedStorage& shared_storage,
-                                  Coord<3> const& bounds,
-                                  Coord<3> const& block)
-      : Base(params, shared_storage, bounds, block) {}
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_global_tile.h
+++ b/cutlass/gemm/gemm_global_tile.h
@ -1,478 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines iterators for efficiently loading and storing to global memory.
-*/
-#pragma once
-
-#include <cutlass/coord.h>
-#include <cutlass/util/platform.h>
-
-#include <cutlass/gemm/gemm_operand.h>
-#include <cutlass/matrix_traits.h>
-#include <cutlass/predicate_vector.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// The following functor reshapes a tile of threads to match a tile of data. The idea is that when
-// the user wants to build the iterator traits, he/she may want to specify the tile independently
-// from the number of scalars loaded/stored per instruction. For example, in the row-major version
-// with a tile of size 128x8 - the user may want to that the iterator works with 32x8 threads if
-// each thread loads 1 scalar per LDG. If the user changes to 4 scalars per LDG, then the tile of
-// threads has to change. The code below detects that and correct the code automatically - it is
-// a helper when the user does not specify the right configuration.
-
-template <typename Tile_, typename Threads_, bool = (Tile_::kW < Threads_::kW)>
-struct ReshapeThreads {
-  typedef Threads_ Threads;
-};
-
-template <typename Tile_, typename Threads_>
-struct ReshapeThreads<Tile_, Threads_, true> {
-  typedef Shape<Threads_::kD, Threads_::kH * Threads_::kW / Tile_::kW, Tile_::kW, 1> Threads;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GemmOperand::Kind kOperand_,
-          MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename Tile_,
-          typename Threads_,
-          int kAccessSize_>
-struct GemmGlobalTileTraits {
-  /// Identity of the operand
-  static GemmOperand::Kind const kOperand = kOperand_;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = kLayout_;
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The number of scalars per LDG/STG.
-  static int const kAccessSize = kAccessSize_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kGlobal;
-
-  /// The tile shape
-  typedef typename ReshapeTile<Tile_, kAccessSize_>::Tile Tile;
-  /// The threads shape
-  typedef typename ReshapeThreads<Tile, Threads_>::Threads Threads;
-  /// The relative offset between two elements in the H/W dimension in adjacent threads.
-  typedef Shape<1, 1, Tile::kC> ThreadsDelta;
-
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, Threads::kH, Threads::kW * kAccessSize> Delta;
-  /// Strides for immediate offset computation
-  typedef Shape<0, 0, Threads::kW * ThreadsDelta::kW, kAccessSize> ImmediateOffsetStrides;
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1, Tile::kH / Threads::kH, Tile::kW / Threads::kW, Tile::kC / kAccessSize>
-      Iterations;
-
-  typedef GemmMultiplicandTraits<Tile, kOperand, kLayout> MultiplicandTraits;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
-      int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
-
-      return make_Coord(0, thread_offset_h, thread_offset_w, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Tile_, typename Threads_, int kStrideH_, int kAccessSize_>
-struct GemmGlobalTileCdTraits : public GemmGlobalTileTraits<GemmOperand::kC,
-                                                            MatrixLayout::kColumnMajor,
-                                                            Scalar_,
-                                                            Tile_,
-                                                            Threads_,
-                                                            kAccessSize_> {
-  /// The base class.
-  typedef GemmGlobalTileTraits<GemmOperand::kC,
-                               MatrixLayout::kColumnMajor,
-                               Scalar_,
-                               Tile_,
-                               Threads_,
-                               kAccessSize_>
-      Base;
-
-  /// The stride in the H dimension.
-  static int const kStrideH = kStrideH_;
-  /// Override the strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> Delta;
-
-  typedef typename Base::Iterations Iterations;
-
-  typedef typename Base::Threads Threads;
-
-  typedef typename Base::ThreadsDelta ThreadsDelta;
-
-  typedef typename Base::ImmediateOffsetStrides ImmediateOffsetStrides;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int thread_offset_h = threadIdx.x / Threads::kW * kStrideH * Iterations::kH;
-      int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
-
-      return make_Coord(0, thread_offset_h, thread_offset_w, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename TileTraits_, typename Index_ = int>
-struct GemmGlobalIteratorAb
-    : public TileLoadIterator<TileTraits_,
-                              typename TileTraits_::Scalar,
-                              TileTraits_::MultiplicandTraits::kKstrided ? IteratorAdvance::kH
-                                                                         : IteratorAdvance::kW,
-                              MemorySpace::kGlobal,
-                              Index_> {
-  /// This class.
-  typedef GemmGlobalIteratorAb<TileTraits_, Index_> This_;  /// The base class.
-
-  typedef TileLoadIterator<TileTraits_,
-                           typename TileTraits_::Scalar,
-                           TileTraits_::MultiplicandTraits::kKstrided ? IteratorAdvance::kH
-                                                                      : IteratorAdvance::kW,
-                           MemorySpace::kGlobal,
-                           Index_>
-      Base;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
-  /// Fragment type loaded by the iterator
-  typedef typename Base::Fragment Fragment;
-  /// The scalar.
-  typedef typename TileTraits_::Scalar Scalar;
-  /// The threads.
-  typedef typename TileTraits_::Threads Threads;
-  /// The index.
-  typedef Index_ Index;
-  /// The thread offset
-  typedef typename TileTraits_::ThreadOffset ThreadOffset;
-  /// Specifies in which dimension post-increment accesses advance.
-  static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
-
-  typedef cutlass::PredicateVector<ShapeCount<typename Base::Iterations>::kCount> PredicateVector;
-
-  /// Iterator parameters type
-  typedef typename Base::Params BaseParams;
-
-  struct Params : public BaseParams {
-    /// Initializes params to load a strip-mined tile, given pointer and stride_h.
-    CUTLASS_HOST_DEVICE int initialize(Scalar const* ptr, Index stride_h) {
-      Index inc_d = 0;
-      Index inc_advance = 0;
-      // Move by some columns for each iteration in the H dimension.
-      Index inc_h = Base::Delta::kH * stride_h;
-
-      // Move by some more columns in the number of iterations if the D dimension is > 1.
-      if (Base::Delta::kD > 0) {
-        inc_d = Base::Delta::kD * stride_h - (Base::Iterations::kH - 1) * inc_h;
-      }
-
-      // Move to the beginning of the next iteration.
-      if (kAdvance == IteratorAdvance::kH && Base::Delta::kD > 0) {
-        inc_advance = inc_d;
-      } else if (kAdvance == IteratorAdvance::kH) {
-        inc_advance = inc_h;
-      } else if (Base::Delta::kD > 0) {
-        inc_advance = (Base::Iterations::kW + 0) * ShapeCount<typename Base::Delta>::kWc -
-                      (Base::Iterations::kH - 1) * inc_h -
-                      (Base::Iterations::kD - 1) * Base::Delta::kD * stride_h;
-      } else {
-        inc_advance = (Base::Iterations::kW + 0) * ShapeCount<typename Base::Delta>::kWc -
-                      (Base::Iterations::kH - 1) * inc_h;
-      }
-
-      Base::Params::initialize(ptr, 0, stride_h, 0, inc_d, inc_h, 0, inc_advance);
-      return 0;
-    }
-  };
-
-  /// Offset of an individual lane from the start of the tile
-  Coord<4> thread_offset;
-  /// The parameters
-  Params params;
-
-  CUTLASS_DEVICE void initialize_predicates(const Coord<3>& bounds, const Coord<3>& block) {
-    // Setup the masks to control loads.
-    predicates.fill(0);
-
-    int bounds_h, bounds_w;
-    if (kAdvance == IteratorAdvance::kH) {
-      bounds_w = bounds[2] - block[2];
-      bounds_h = bounds[1];
-
-    } else {
-      bounds_w = bounds[1];
-      bounds_h = bounds[2] - block[1];
-    }
-
-    // Fill in the bits of the predicate vector.
-    for (int d = 0; d < Base::Iterations::kD; ++d) {
-      for (int h = 0; h < Base::Iterations::kH; ++h) {
-        for (int w = 0; w < Base::Iterations::kW; ++w) {
-          for (int c = 0; c < Base::Iterations::kC; ++c) {
-            bool flag = w * Base::Delta::kW < bounds_w;
-            if (kAdvance == IteratorAdvance::kH) {
-              flag = flag && (h * Base::Delta::kH + d * Base::Delta::kD) < bounds_h;
-            } else {
-              flag = flag && (h * Base::Delta::kH) < bounds_h;
-            }
-            int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
-            predicates.set(bit, flag);
-          }
-        }
-      }
-    }
-  }
-
-  /// Ctor.
-  CUTLASS_DEVICE GemmGlobalIteratorAb(Params const& _params,
-                                      const Coord<3>& bounds,
-                                      const Coord<3>& block,
-                                      ThreadOffset thread_offset_func = ThreadOffset())
-      : params(_params) {
-    thread_offset = thread_offset_func();
-    // The column.
-    Index block_h = thread_offset[1];
-    // The contiguous dimension.
-    Index block_w = thread_offset[2];
-
-    // Add the blocks indices.
-    if (kAdvance == IteratorAdvance::kH) {
-      block_h += block[1];
-      block_w += block[2];
-
-    } else {
-      block_h += block[2];
-      block_w += block[1];
-    }
-
-    // Setup the pointer.
-    params.pointer += (block_h * params.stride_h + block_w);
-
-    // Initialize predicates
-    initialize_predicates(bounds, make_Coord(0, block_h, block_w));
-  }
-
-  /// Increment the pointer in the H dimension.
-  CUTLASS_DEVICE void inc_h() { params.pointer += params.inc_h; }
-  /// Increment the pointer in the D dimension.
-  CUTLASS_DEVICE void inc_d() { params.pointer += params.inc_d; }
-  /// Increment the pointer to move to the next iteration.
-  CUTLASS_DEVICE void inc_advance() { params.pointer += params.inc_advance; }
-
-  /// Returns the current pointer
-  CUTLASS_HOST_DEVICE
-  Scalar const* data() const { return params.pointer; }
-
-  /// That's the residue! Update the predicates.
-  CUTLASS_DEVICE void residue(Index k) {
-    // The coordinates of the thread.
-    Index block_h = thread_offset[1];
-    // The contiguous dimension.
-    Index block_w = thread_offset[2];
-
-    // Update the predicate vector.
-    for (int d = 0; d < Base::Iterations::kD; ++d) {
-      for (int h = 0; h < Base::Iterations::kH; ++h) {
-        for (int w = 0; w < Base::Iterations::kW; ++w) {
-          for (int c = 0; c < Base::Iterations::kC; ++c) {
-            Index offset = 0;
-            if (kAdvance == IteratorAdvance::kH) {
-              offset += block_h + h * Base::Delta::kH + d * Base::Delta::kD;
-            } else {
-              offset += block_w + w * Base::Delta::kW;
-            }
-
-            int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
-            if (offset >= k) {
-              predicates.set(bit, false);
-            }
-          }
-        }
-      }
-    }
-  }
-
-  /// Is the iterator valid?
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
-    int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
-    return predicates[bit];
-  }
-
-  /// The predicates.
-  PredicateVector predicates;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename TileTraits_, typename Index_ = int>
-struct GemmGlobalIteratorCd : public TileIteratorBase<TileTraits_,
-                                                      typename TileTraits_::Scalar,
-                                                      IteratorAdvance::kH,
-                                                      MemorySpace::kGlobal,
-                                                      Index_> {
-  /// This class.
-  typedef GemmGlobalIteratorCd<TileTraits_, Index_> This_;
-  /// The base class.
-  typedef TileIteratorBase<TileTraits_,
-                           typename TileTraits_::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kGlobal,
-                           Index_>
-      Base;
-
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
-
-  /// The scalar.
-  typedef typename TileTraits_::Scalar Scalar;
-  /// The pointer.
-  typedef typename TileTraits_::Pointer Pointer;
-  /// The threads.
-  typedef typename TileTraits_::Threads Threads;
-  /// The index.
-  typedef Index_ Index;
-  /// The thread offset
-  typedef typename TileTraits_::ThreadOffset ThreadOffset;
-
-  /// The params.
-  struct Params {
-    /// The pointer.
-    Pointer pointer;
-    /// The stride in the H dimension to setup the thread in the block.
-    Index stride_h;
-    /// The strides to increment the pointer.
-    Index inc_advance, inc_h;
-    /// The strides to increment the predicate offset
-    Index predicate_inc_advance, predicate_inc_h;
-    /// The column offset to compute the predicate for the columns.
-    Index predicate_offset;
-
-    /// Setup the params.
-    CUTLASS_HOST_DEVICE int initialize(
-        Pointer pointer, Index ld, Index bound, Index epilogue_stride_w, Index epilogue_delta_w) {
-      // The pointer.
-      this->pointer = pointer;
-      // Each column of the matrix.
-      stride_h = TileTraits_::ThreadsDelta::kH * ld;
-      // Each thread output 1 column per iteration. The stride between columns is given by the
-      // number of scalars that are loaded per LDS for B.
-      inc_h = ld * TileTraits_::kStrideH;
-      inc_advance =
-          (ld - ld * TileTraits_::kStrideH * (Base::Iterations::kH - 1)) + epilogue_stride_w;
-
-      predicate_offset = bound;
-      predicate_inc_h = TileTraits_::kStrideH;
-      predicate_inc_advance =
-          -((TileTraits_::kStrideH * (Base::Iterations::kH - 1) - 1) + epilogue_delta_w);
-
-      return 0;
-    }
-  };
-
-  Params params;
-  /// Offset of an individual lane from the start of the tile
-  Coord<4> thread_offset;
-
-  /// Ctor.
-  CUTLASS_DEVICE GemmGlobalIteratorCd() {}
-
-  /// Ctor.
-  CUTLASS_DEVICE GemmGlobalIteratorCd(Params const& params,
-                                      const Coord<3>& bounds,
-                                      const Coord<3>& block,
-                                      int offset = 0,
-                                      int pred_offset = 0,
-                                      ThreadOffset thread_offset_func = ThreadOffset())
-      : params(params) {
-    thread_offset = thread_offset_func();
-    // Each warp works on a different column of the tile.
-    int const h = thread_offset[1] + block[1];
-    // Each lane writes a different element.
-    int const w = thread_offset[2] + block[2];
-    // Setup the pointer.
-    this->params.pointer += ((h * params.stride_h + w) + offset);
-
-    // Prepare the vector of predicates.
-    for (int i = 0; i < Base::Iterations::kW; ++i) {
-      predicates.set(i, w + i * Base::Delta::kW < bounds[2]);
-    }
-    this->params.predicate_offset -= (h + pred_offset);
-  }
-
-  /// Increment the pointer in the C dimension.
-  CUTLASS_DEVICE void inc_c() {}
-  /// Increment the pointer in the W dimension.
-  CUTLASS_DEVICE void inc_w() {}
-  /// Increment the pointer in the H dimension.
-  CUTLASS_DEVICE void inc_h() {
-    params.pointer += params.inc_h;
-    params.predicate_offset -= params.predicate_inc_h;
-  }
-  /// Increment the pointer in the D dimension.
-  CUTLASS_DEVICE void inc_d() {}
-  /// Increment the pointer to move to the next iteration.
-  CUTLASS_DEVICE void inc_advance() {
-    params.pointer += params.inc_advance;
-    this->params.predicate_offset -= params.predicate_inc_advance;
-  }
-
-  /// Test the validity of the iterator.
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
-    return predicates.at(w) && params.predicate_offset > 0;
-  }
-
-  /// Returns the raw pointer
-  CUTLASS_HOST_DEVICE
-  Pointer data() { return params.pointer; }
-
-  CUTLASS_HOST_DEVICE
-  Pointer const data() const { return params.pointer; }
-
-  /// The predicates for the row.
-  cutlass::PredicateVector<Base::Iterations::kW> predicates;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_operand.h
+++ b/cutlass/gemm/gemm_operand.h
@ -1,141 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear
-   memory.
-*/
-#pragma once
-
-#include <cutlass/matrix_traits.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/util/platform.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Helper to describe attributes of GEMM matrix operands
-template <GemmOperand::Kind kOperand_, MatrixLayout::Kind kLayout_>
-struct GemmOperandTraitsAb {
-  static const bool Congruous =
-      (kOperand_ == GemmOperand::kA ^ kLayout_ == MatrixLayout::kRowMajor);
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmOperand::Kind kOperand_, typename Tile_>
-struct GetExtent;
-
-template <typename Tile_>
-struct GetExtent<GemmOperand::kA, Tile_> {
-  static const int kExtent = Tile_::kW;
-};
-
-template <typename Tile_>
-struct GetExtent<GemmOperand::kB, Tile_> {
-  static const int kExtent = Tile_::kH;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Determines the shape of a multiplicand tile in terms of strided (H) and contiguous (W)
-/// dimensions
-template <typename ThreadBlockTile_, GemmOperand::Kind Usage, MatrixLayout::Kind Layout>
-struct GemmMultiplicandTraits {
-  // Only defined for A or B
-  static_assert(Usage == GemmOperand::kA || Usage == GemmOperand::kB,
-                "MultiplicandTileShape defined only for A or B operands.");
-
-  /// Shape of GEMM thread block tile (K, N, M)
-  typedef ThreadBlockTile_ ThreadBlockTile;
-
-  /// Identifies multiplicand
-  static GemmOperand::Kind const kUsage = Usage;
-
-  /// Layout of tile
-  static MatrixLayout::Kind const kLayout = Layout;
-
-  // True if K is the strided dimension
-  static bool const kKstrided = (kUsage == GemmOperand::kA ^ kLayout == MatrixLayout::kRowMajor);
-
-  /// Map the ThreadBlockShape onto (kH, kW) dimensions for A and B operand
-  typedef typename platform::conditional<
-      kKstrided,
-      Shape<1, ThreadBlockTile::kD, GetExtent<Usage, ThreadBlockTile>::kExtent>,
-      Shape<1, GetExtent<Usage, ThreadBlockTile>::kExtent, ThreadBlockTile::kD> >::type Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Project's a coordinate (K, N, M) onto inner and outer dimensions defined for each
-/// operand.
-template <GemmOperand::Kind operand, bool Kstrided = true>
-struct ProjectOperand;
-
-/// Project A operand - (0, K, M)
-template <bool Kstrided>
-struct ProjectOperand<GemmOperand::kA, Kstrided> {
-  CUTLASS_HOST_DEVICE
-  static Coord<3> project(Coord<3> const &coord) {
-    if (Kstrided) {
-      return make_Coord(0, coord[0], coord[2]);
-    } else {
-      return make_Coord(0, coord[2], coord[0]);
-    }
-  }
-};
-
-/// Project B operand - (0, K, N)
-template <bool Kstrided>
-struct ProjectOperand<GemmOperand::kB, Kstrided> {
-  CUTLASS_HOST_DEVICE
-  static Coord<3> project(Coord<3> const &coord) {
-    if (Kstrided) {
-      return make_Coord(0, coord[0], coord[1]);
-    } else {
-      return make_Coord(0, coord[1], coord[0]);
-    }
-  }
-};
-
-/// Project C operand - (0, N, M)
-template <>
-struct ProjectOperand<GemmOperand::kC, true> {
-  CUTLASS_HOST_DEVICE
-  static Coord<3> project(Coord<3> const &coord) { return make_Coord(0, coord[1], coord[2]); }
-};
-
-/// Project D operand - (0, N, M)
-template <>
-struct ProjectOperand<GemmOperand::kD, true> {
-  CUTLASS_HOST_DEVICE
-  static Coord<3> project(Coord<3> const &coord) { return make_Coord(0, coord[1], coord[2]); }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_shared_stream.h
+++ b/cutlass/gemm/gemm_shared_stream.h
@ -1,113 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines abstractions for managing loading and storing fragments to shared memory in the
-      efficient GEMM pipeline.
-*/
-#pragma once
-
-#include <cutlass/gemm/gemm_shared_tile.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The load iterator.
-    typename Iterator_,
-    /// The transformer to be applied after the data has been copied from shared memory.
-    typename Transformer_ = Copy<typename Iterator_::Fragment> >
-
-struct SharedLoadStream {
-  /// The load iterator.
-  typedef Iterator_ Iterator;
-  /// The transformer.
-  typedef Transformer_ Transformer;
-
-  /// The fragment that is copied from shared memory.
-  typedef typename Iterator::Fragment FetchedFragment;
-  /// The fragment that is obtained after the transformation by the transformer.
-  typedef typename Transformer::OutputFragment TransformedFragment;
-  /// Make sure the fragments match.
-  static_assert((platform::is_same<FetchedFragment, typename Transformer::InputFragment>::value),
-                "");
-  /// The output fragment.
-  typedef TransformedFragment Fragment;
-
-  /// The params.
-  struct Params {
-    /// The iterator params.
-    typename Iterator::Params iterator;
-
-    /// Setup the params.
-    CUTLASS_HOST_DEVICE int initialize() { return iterator.initialize(); }
-  };
-
-  /// The storage in shared memory needed by that stream.
-  typedef typename Iterator::Storage SharedStorage;
-
-  /// Ctor.
-  CUTLASS_DEVICE SharedLoadStream() {}
-
-  /// Ctor.
-  CUTLASS_DEVICE SharedLoadStream(Params const &params, SharedStorage &shared_storage) {
-    this->initialize(params, shared_storage);
-  }
-
-  /// Initialize the stream.
-  CUTLASS_DEVICE void initialize(Params const &params, SharedStorage &shared_storage) {
-    // The iterator.
-    iterator = Iterator(params.iterator, shared_storage);
-    // The transformer.
-    transformer = Transformer();
-  }
-
-  /// Load the data from shared memory to the fetch fragment.
-  CUTLASS_DEVICE void copy(FetchedFragment &fetched) { shared_iterator_load(iterator, fetched); }
-
-  /// Load the data from shared memory to the fetch fragment.
-  CUTLASS_DEVICE void copy(int d, FetchedFragment &fetched) {
-    shared_iterator_load(iterator, fetched, d);
-  }
-
-  /// Commit the data.
-  CUTLASS_DEVICE void commit(FetchedFragment &fetched, TransformedFragment &transformed) {
-    transformer.transform(fetched, transformed);
-  }
-
-  /// Increment the stage.
-  CUTLASS_DEVICE void inc_stage() { iterator.inc_stage(); }
-
-  /// The iterator.
-  Iterator iterator;
-  /// The transformer.
-  Transformer transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_shared_tile.h
+++ b/cutlass/gemm/gemm_shared_tile.h
@ -1,406 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines iterators for efficiently loading and storing tiles to and from shared memory.
-*/
-#pragma once
-
-#include <cutlass/gemm/gemm_operand.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerSts_>
-struct GemmSharedStoreTileAbTraits {
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The tile.
-  typedef typename ReshapeTile<Tile_, kScalarsPerSts_>::Tile Tile;
-  /// The threads.
-  typedef Threads_ Threads;
-  /// The strides to compute the base position of the thread.
-  typedef Shape<0, ShapeCount<Tile>::kWc, Tile::kC, kScalarsPerSts_> ThreadsStrides;
-  /// The skew.
-  static int const kSkew = 0;
-  /// The number of scalars per LDG/STG.
-  static int const kAccessSize = kScalarsPerSts_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1,
-                Tile::kH / Threads::kH,
-                Tile::kW / Threads::kW,
-                Tile::kC / Threads::kC / kAccessSize>
-      Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kAccessSize> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kAccessSize>
-      ImmediateOffsetStrides;
-
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerSts_, int kSkew_>
-struct GemmSharedStoreWithSkewTileAbTraits {
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The tile without skews.
-  typedef typename ReshapeTile<Tile_, kScalarsPerSts_>::Tile TileWithoutSkew;
-  /// The tile.
-  typedef typename ReshapeTile<Shape<Tile_::kD, Tile_::kH, Tile_::kW + kSkew_>,
-                               kScalarsPerSts_>::Tile Tile;
-  /// The threads.
-  typedef Threads_ Threads;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The number of scalars per STS.
-  static int const kAccessSize = kScalarsPerSts_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1, TileWithoutSkew::kH / Threads::kW, TileWithoutSkew::kW / Threads::kH> Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, ShapeCount<Tile>::kWc, Threads::kH * kAccessSize> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, ShapeCount<Tile>::kWc, Threads::kH * kAccessSize> ImmediateOffsetStrides;
-
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-
- protected:
-  /// The strides to compute the base position of the thread.
-  typedef Shape<0, kScalarsPerSts_, ShapeCount<Tile>::kHwc / Threads::kW> ThreadsStrides;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          typename OutputTile_,
-          typename Warps_,
-          typename ThreadsPerWarp_,
-          typename InstructionShape_,
-          int kStages_,
-          int kScalarsPerLds_,
-          int kSkew_ = 0>
-struct GemmSharedLoadTileATraits {
-  static GemmOperand::Kind const kOperand = GemmOperand::kA;
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The tile without skew.
-  typedef Shape<kStages_,
-                OutputTile_::kD / InstructionShape_::kD,
-                GetExtent<kOperand, OutputTile_>::kExtent * InstructionShape_::kD>
-      TileWithoutSkew_;
-  /// The tile with skew.
-  typedef Shape<kStages_, TileWithoutSkew_::kH, TileWithoutSkew_::kW + kSkew_> TileWithSkew;
-  /// The tile without skew after reshaping.
-  typedef typename ReshapeTile<TileWithoutSkew_, kScalarsPerLds_>::Tile TileWithoutSkew;
-  /// The tile.
-  typedef typename ReshapeTile<TileWithSkew, kScalarsPerLds_>::Tile Tile;
-  /// The number of warps.
-  typedef Warps_ Warps;
-  /// The threads in a warp.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of scalars per LDG/STG.
-  // static int const kScalarsPerLds = kScalarsPerLds_;
-  static int const kAccessSize = kScalarsPerLds_;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of warps.
-  static int const kWarps = GetExtent<kOperand, Warps>::kExtent;
-  /// The number of threads in one dimension of the warp.
-  static int const kThreadsPerWarp = GetExtent<kOperand, ThreadsPerWarp>::kExtent;
-
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1, 1, TileWithoutSkew::kW / kWarps / kThreadsPerWarp /* / kScalarsPerLds*/>
-      Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0>
-      ImmediateOffsetStrides;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // Extract the warp.
-      int const warp = threadIdx.x / kWarpSize % Warps::kW;
-      // Compute the row offset for each thread
-      int const lane = (threadIdx.x & 0x0e) / 2;
-      // The offset.
-      int const offset = (warp * ThreadsPerWarp::kW + lane) * kAccessSize;
-
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          typename OutputTile_,
-          typename Warps_,
-          typename ThreadsPerWarp_,
-          typename InstructionShape_,
-          int kStages_,
-          int kScalarsPerLds_,
-          int kSkew_ = 0>
-struct GemmSharedLoadTileBTraits {
-  static GemmOperand::Kind const kOperand = GemmOperand::kB;
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The tile without skew.
-  typedef Shape<kStages_,
-                OutputTile_::kD / InstructionShape_::kD,
-                GetExtent<kOperand, OutputTile_>::kExtent * InstructionShape_::kD>
-      TileWithoutSkew_;
-  /// The tile with skew.
-  typedef Shape<kStages_, TileWithoutSkew_::kH, TileWithoutSkew_::kW + kSkew_> TileWithSkew;
-  /// The tile without skew after reshaping.
-  typedef typename ReshapeTile<TileWithoutSkew_, kScalarsPerLds_>::Tile TileWithoutSkew;
-  /// The tile.
-  typedef typename ReshapeTile<TileWithSkew, kScalarsPerLds_>::Tile Tile;
-  /// The number of warps.
-  typedef Warps_ Warps;
-  /// The threads in a warp.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of scalars per LDG/STG.
-  static int const kAccessSize = kScalarsPerLds_;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of warps.
-  static int const kWarps = GetExtent<kOperand, Warps>::kExtent;
-  /// The number of threads in one dimension of the warp.
-  static int const kThreadsPerWarp = GetExtent<kOperand, ThreadsPerWarp>::kExtent;
-
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1, 1, TileWithoutSkew::kW / kWarps / kThreadsPerWarp /* / kAccessSize*/> Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0>
-      ImmediateOffsetStrides;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // The position of the warp.
-      int const warp = threadIdx.x / (Warps::kW * kWarpSize);
-
-      // Compute the column offset for each thread
-      int const lane = (threadIdx.x & 0x10) / 8 + (threadIdx.x & 0x01);
-      // The offset.
-      int const offset = (warp * ThreadsPerWarp::kH + lane) * kAccessSize;
-
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          typename OutputTile_,
-          typename Warps_,
-          typename ThreadsPerWarp_,
-          int kScalarsPerSts_,
-          int kSkew_ = 0>
-struct GemmSharedStoreTileDTraits {
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The dimension of the output tile.
-  typedef OutputTile_ OutputTile;
-  /// The warps in the tile.
-  typedef Warps_ Warps;
-  /// The threads in the warps.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of scalars per LDG/STG.
-  static int const kAccessSize = kScalarsPerSts_;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of scalars per thread.
-  static int const kScalarsPerThread = OutputTile_::kW / Warps::kW / ThreadsPerWarp::kW;
-  /// The number of threads.
-  static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
-  /// The number of scalars per row. We build a tile with 2 rows (to avoid bank conflicts).
-  static int const kScalarsPerRow = kThreads / 2 * kScalarsPerThread + kSkew;
-
-  /// The tile.
-  typedef Shape<1, 2, kScalarsPerRow / kAccessSize, kAccessSize> Tile;
-  /// The number of iterations needed to store the tile.
-  typedef Shape<1, 1, kScalarsPerThread / kAccessSize> Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Warps::kW * ThreadsPerWarp::kW * kAccessSize> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Warps::kW * ThreadsPerWarp::kW * kAccessSize> ImmediateOffsetStrides;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // We issue STS.128 in the epilogue to store the accumulators to shared memory. When we use
-      // STS.128, we have to guarantee that threads in groups of 8 do not have bank conflicts (i.e
-      // they write to different banks).
-
-      // Odd threads go to the second half of shared memory.
-      int const row = threadIdx.x & 0x01;
-
-      int const warp_id = (threadIdx.x >> 5);
-
-      int const warp_row = (warp_id % Warps::kW);
-      int const warp_col = (warp_id / Warps::kW);
-
-      int hi_halfwarp_offset = OutputTile::kW * ((threadIdx.x >> 4) & 1);
-      int lo_halfwarp_offset = (((threadIdx.x >> 1) & 0x7) + warp_row * ThreadsPerWarp::kW);
-
-      int col = kAccessSize * lo_halfwarp_offset +
-                warp_col * (ThreadsPerWarp::kH / 2) * OutputTile::kW + hi_halfwarp_offset;
-
-      int offset = row * kScalarsPerRow + col;
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          typename OutputTile_,
-          typename Warps_,
-          typename ThreadsPerWarp_,
-          int kTileH_,
-          int kScalarsPerLds_,
-          int kSkew_ = 0>
-struct GemmSharedLoadTileDTraits {
-  /// The scalar.
-  typedef typename platform::remove_const<Scalar_>::type Scalar;
-  /// The pointer.
-  typedef Scalar_* Pointer;
-  /// The dimension of the output tile.
-  typedef OutputTile_ OutputTile;
-  /// The warps in the tile.
-  typedef Warps_ Warps;
-  /// The threads in the warps.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of scalars per LDG/STG.
-  static int const kAccessSize = kScalarsPerLds_;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The number of scalars per thread.
-  static int const kScalarsPerThread = OutputTile_::kW / Warps::kW / ThreadsPerWarp::kW;
-  /// The number of threads.
-  static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
-  /// The number of scalars per row. We build a tile with 2 rows (to avoid bank conflicts).
-  static int const kScalarsPerRow = kThreads / 2 * kScalarsPerThread + kSkew;
-
-  /// The tile.
-  typedef Shape<1, 2, kScalarsPerRow / kAccessSize, kAccessSize> Tile;
-
-  // Compute the number of iterations per warp in the Tile::kH dimension.
-  static int const kIterationsInHPerWarp = kTileH_ / ShapeCount<Warps>::kCount;
-
-  // As shown above, the shared memory tile is composed of 2 rows and each rows is made of
-  // kScalarsPerRow. A warp is expected to read from the 1st row, then move to the 2nd row and go
-  // back to the 1st row. To model that scheme we define the Iterations shape as Shape<X, 2, ...>.
-  // However, in some cases, we have only 1 iteration per warp. In that case, we must define the
-  // shape as Shape<1, 1, ...>. The following code does that.
-  static int const kIterationsH = kIterationsInHPerWarp == 1 ? 1 : 2;
-  // As soon as we know kIterationsH, it is trivial to compute kIterationsD:
-  static int const kIterationsD = kIterationsInHPerWarp / kIterationsH;
-
-  /// The number of iterations needed to store the tile.
-  typedef Shape<kIterationsD, kIterationsH, OutputTile::kW / kWarpSize / kAccessSize> Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<OutputTile::kW, kScalarsPerRow, kWarpSize * kAccessSize> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<OutputTile::kW, kScalarsPerRow, kWarpSize * kAccessSize> ImmediateOffsetStrides;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // Each warp works on a different column.
-      int const h = threadIdx.x / kWarpSize;
-      // Compute the row.
-      int const w = (threadIdx.x & (kWarpSize - 1)) * kAccessSize;
-      int offset = 0;
-      if (Iterations::kH == 1) {
-        int const row = h & 0x1;
-        int const col = h / 2;
-        offset = row * ShapeCount<Tile>::kWc + col * OutputTile::kW * Iterations::kD + w;
-      } else {
-        offset = h * OutputTile::kW * Iterations::kD + w;
-      }
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/gemm_traits.h
+++ b/cutlass/gemm/gemm_traits.h
@ -1,747 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines structural properties of complete GEMM computation.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/gemm/clear_accumulators.h>
-#include <cutlass/gemm/gemm_global_stream.h>
-#include <cutlass/gemm/gemm_operand.h>
-#include <cutlass/gemm/gemm_shared_stream.h>
-#include <cutlass/gemm/identity_block_swizzle.h>
-#include <cutlass/matrix_traits.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The scalar type for A.
-    typename ScalarA_,
-    /// The scalar type for B.
-    typename ScalarB_,
-    /// The scalar type for C.
-    typename ScalarC_,
-    /// The scalar type for D.
-    typename ScalarD_,
-    /// The output tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The functor to do the math.
-    typename MultiplyAdd_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_,
-    /// The number of scalars per STS for A.
-    int kScalarsPerStsA_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdsA_,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_,
-    /// The number of scalars per STS for B.
-    int kScalarsPerStsB_,
-    /// The number of scalars per LDS for B.
-    int kScalarsPerLdsB_,
-    /// The number of scalars per LDG for C and STG for D.
-    int kScalarsPerLdgCAndStgD_,
-    /// The number of scalars per STS for D.
-    int kScalarsPerStsD_,
-    /// The number of scalars per LDS for D.
-    int kScalarsPerLdsD_,
-    /// The number of stages in shared memory to do single/double/triple-buffering.
-    int kStages_>
-
-struct GemmConfig {
-  //
-  /// The scalar for A.
-  typedef ScalarA_ ScalarA;
-  /// The scalar for B.
-  typedef ScalarB_ ScalarB;
-  /// The scalar for C.
-  typedef ScalarC_ ScalarC;
-  /// The scalar for D.
-  typedef ScalarD_ ScalarD;
-
-  /// The tile.
-  typedef OutputTile_ OutputTile;
-  /// The functor to do D = A*B + C.
-  typedef MultiplyAdd_ MultiplyAdd;
-  /// The shape of the instruction.
-  typedef typename MultiplyAdd::InstructionShape InstructionShape;
-  /// The number of accumulators per warp.
-  typedef typename MultiplyAdd::AccumulatorsPerWarp AccumulatorsPerWarp;
-  /// The accumulators.
-  typedef typename MultiplyAdd::Accumulators Accumulators;
-
-  /// The number of warps.
-  typedef typename ShapeDiv<OutputTile, AccumulatorsPerWarp>::Shape Warps;
-  /// The default warp size (32 threads per warp).
-  static int const kWarpSize = cutlass::kWarpSize;
-  /// The numnber of threads.
-  static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
-
-  /// The number of scalars per LDG/STS/LDS for A.
-  static int const kScalarsPerLdgA = kScalarsPerLdgA_;
-  static int const kScalarsPerStsA = kScalarsPerStsA_;
-  static int const kScalarsPerLdsA = kScalarsPerLdsA_;
-
-  /// The number of scalars per LDG/STS/LDS for B.
-  static int const kScalarsPerLdgB = kScalarsPerLdgB_;
-  static int const kScalarsPerStsB = kScalarsPerStsB_;
-  static int const kScalarsPerLdsB = kScalarsPerLdsB_;
-
-  /// The number of scalars per LDG for C.
-  static int const kScalarsPerLdgC = kScalarsPerLdgCAndStgD_;
-
-  /// The number of scalars per STS/LDS/STG for D.
-  static int const kScalarsPerStgD = kScalarsPerLdgCAndStgD_;
-  static int const kScalarsPerStsD = kScalarsPerStsD_;
-  static int const kScalarsPerLdsD = kScalarsPerLdsD_;
-
-  /// The number of accumulators that are going to be fed from one LDS A/B.
-  static int const kAccumulatorsPerLdsA = kScalarsPerLdsA / InstructionShape::kD;
-  static int const kAccumulatorsPerLdsB = kScalarsPerLdsB / InstructionShape::kD;
-
-  /// The number of stages in shared memory to implement double, triple, more-buffering.
-  static int const kStages = kStages_;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind, typename GemmConfig_>
-struct GemmTileTraitsHelperA {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarA Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
-
-  /// The traits class to build the iterator to load data from global memory for A^N.
-  typedef GemmGlobalTileTraits<
-      // That's A.
-      GemmOperand::kA,
-      // A is column-major.
-      MatrixLayout::kColumnMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgA>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for A^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer is float.
-      MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsA>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for A^N.
-  typedef GemmSharedLoadTileATraits<
-      // The pointer is float const.
-      MultiplyAddScalar const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsA,
-      // The skew.
-      0>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarA Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
-
-  /// The traits class to build the iterator to load data from global memory for A^T.
-  typedef GemmGlobalTileTraits<
-      // That's A.
-      GemmOperand::kA,
-      // A is row-major.
-      MatrixLayout::kRowMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size MxK in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgA>
-      GlobalTileTraits;
-
-  /// The number of scalars in 4B.
-  static int const kScalarsIn4B = sizeof(MultiplyAddScalar) > 4 ? 1 : 4 / sizeof(MultiplyAddScalar);
-  /// The traits class to build the iterator to store data to shared memory for A^T.
-  typedef GemmSharedStoreWithSkewTileAbTraits<
-      // The pointer is float.
-      MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS.
-      GemmConfig_::kScalarsPerStsA,
-      // The skew to avoid bank conflicts added in the tile W dimension.
-      128 / sizeof(MultiplyAddScalar) / GemmConfig_::kScalarsPerStsA /
-          GlobalTileTraits::Threads::kW * kScalarsIn4B>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for A^T.
-  typedef GemmSharedLoadTileATraits<
-      // The pointer is float const.
-      MultiplyAddScalar const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsA,
-      // The skew.
-      SharedStoreTileTraits::kSkew>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind, typename GemmConfig_>
-struct GemmTileTraitsHelperB {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarB Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
-
-  /// The traits class to build the iterator to load data from global memory for B^N.
-  typedef GemmGlobalTileTraits<
-      // That's B.
-      GemmOperand::kB,
-      // B is column-major.
-      MatrixLayout::kColumnMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size MxK in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgB>
-      GlobalTileTraits;
-
-  /// The number of scalars in 4B.
-  static int const kScalarsIn4B = sizeof(MultiplyAddScalar) > 4 ? 1 : 4 / sizeof(MultiplyAddScalar);
-  /// The traits class to build the iterator to store data to shared memory for B^N.
-  typedef GemmSharedStoreWithSkewTileAbTraits<
-      // The pointer is float.
-      MultiplyAddScalar,
-      // The tile has size KxN in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS.
-      GemmConfig_::kScalarsPerStsB,
-      // The skew to avoid bank conflicts added in the tile W dimension.
-      128 / sizeof(MultiplyAddScalar) / GemmConfig_::kScalarsPerStsB /
-          GlobalTileTraits::Threads::kW * kScalarsIn4B>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for B^N.
-  typedef GemmSharedLoadTileBTraits<
-      // The pointer is float const.
-      MultiplyAddScalar const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsB,
-      // The skew.
-      SharedStoreTileTraits::kSkew>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarB Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
-
-  /// The traits class to build the iterator to load data from global memory for B^T.
-  typedef GemmGlobalTileTraits<
-      // That's B.
-      GemmOperand::kB,
-      // B is row-major.
-      MatrixLayout::kRowMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size KxN in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kH>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgB>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for B^T.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer is float.
-      MultiplyAddScalar,
-      // The tile has size KxN in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsB>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for B^T.
-  typedef GemmSharedLoadTileBTraits<
-      // The pointer is float const.
-      MultiplyAddScalar const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsB,
-      // The skew.
-      0>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The GEMM configuration.
-    typename GemmConfig_,
-    /// The stream to load A from global memory to shared memory.
-    typename GlobalLoadStreamA_,
-    /// The stream to load B from global memory to shared memory.
-    typename GlobalLoadStreamB_,
-    /// The stream to load A from shared memory.
-    typename SharedLoadStreamA_,
-    /// The stream to load B from shared memory.
-    typename SharedLoadStreamB_,
-    /// The epilogue.
-    typename Epilogue_,
-    /// The block swizzle to reorganize the grid.
-    typename BlockSwizzle_ = IdentityBlockSwizzle,
-    /// The index.
-    typename Index_ = int,
-    /// The tool used to clear accumulators.
-    typename ClearAccumulators_ = ClearAccumulators<typename GemmConfig_::Accumulators::Scalar> >
-
-struct GemmTraits {
-  /// The configuration.
-  typedef GemmConfig_ GemmConfig;
-  /// The output tile.
-  typedef typename GemmConfig::OutputTile OutputTile;
-
-  /// The stream to load A from global memory to shared memory.
-  typedef GlobalLoadStreamA_ GlobalLoadStreamA;
-  /// The layout of A.
-  static MatrixLayout::Kind const kLayoutA = GlobalLoadStreamA::kLayout;
-  /// The scalar for A.
-  typedef typename GlobalLoadStreamA_::Scalar ScalarA;
-
-  /// The stream to load B from global memory to shared memory.
-  typedef GlobalLoadStreamB_ GlobalLoadStreamB;
-  /// The layout of B.
-  static MatrixLayout::Kind const kLayoutB = GlobalLoadStreamB::kLayout;
-  /// The scalar for B.
-  typedef typename GlobalLoadStreamB_::Scalar ScalarB;
-
-  /// The iterator for A to load from shared memory.
-  typedef SharedLoadStreamA_ SharedLoadStreamA;
-  /// The iterator for B to load from shared memory.
-  typedef SharedLoadStreamB_ SharedLoadStreamB;
-
-  /// The shared storage for A.
-  typedef typename GlobalLoadStreamA::SharedStoreStorage SharedStoreStorageA;
-  // Btw, make sure we did not messed up with the size of the storage.
-  static_assert(sizeof(SharedStoreStorageA) == sizeof(typename SharedLoadStreamA::SharedStorage),
-                "");
-
-  /// The shared storage for B.
-  typedef typename GlobalLoadStreamB::SharedStoreStorage SharedStoreStorageB;
-  // Btw, make sure we did not messed up with the size of the storage.
-  static_assert(sizeof(SharedStoreStorageB) == sizeof(typename SharedLoadStreamB::SharedStorage),
-                "");
-
-  /// The multiply-add functor.
-  typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
-  /// The epilogue.
-  typedef Epilogue_ Epilogue;
-  /// The scalars in the epilogue.
-  typedef typename Epilogue::ScalarC ScalarC;
-  typedef typename Epilogue::ScalarD ScalarD;
-
-  /// The block swizzle to reorganize the grid.
-  typedef BlockSwizzle_ BlockSwizzle;
-  /// The index.
-  typedef Index_ Index;
-  /// Clear the accumulators.
-  typedef ClearAccumulators_ ClearAccumulators;
-
-  /// The params.
-  struct Params {
-    /// The dimensions of the GEMM.
-    Index m, n, k;
-    /// The params for the A stream.
-    typename GlobalLoadStreamA::Params global_stream_a;
-    /// The params for the B stream.
-    typename GlobalLoadStreamB::Params global_stream_b;
-    /// The params for the A stream from shared memory.
-    typename SharedLoadStreamA::Params shared_stream_a;
-    /// The params for the B stream from shared memory.
-    typename SharedLoadStreamB::Params shared_stream_b;
-    /// The params for the epilogue.
-    typename Epilogue::Params epilogue;
-
-    /// Initialize the parameters.
-    template <typename GemmDesc_>
-    CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
-      // Set the problem size.
-      this->m = desc.m;
-      this->n = desc.n;
-      this->k = desc.k;
-
-      // Initialize the iterator for A.
-      int error_code =
-          global_stream_a.initialize(reinterpret_cast<ScalarA const*>(desc.d_a), desc.lda);
-
-      if (error_code) {
-        return error_code;
-      }
-
-      // Initialize the iterator for B.
-      error_code = global_stream_b.initialize(reinterpret_cast<ScalarB const*>(desc.d_b), desc.ldb);
-
-      if (error_code) {
-        return error_code;
-      }
-
-      // The epilogue.
-      return epilogue.initialize(desc);
-    }
-  };
-
-  // The storage for A.
-  template <typename GlobalLoadStream_, typename SharedLoadStream_>
-  union StreamSharedStorage {
-    // The storage needed by the global stream.
-    typename GlobalLoadStream_::SharedStorage global;
-    // The storage needed by the shared stream.
-    typename SharedLoadStream_::SharedStorage shared;
-  };
-
-  // The storage for the main loop + prologue.
-  struct MainLoopSharedStorage {
-    // The storage to shuffle the A matrix in shared memory.
-    StreamSharedStorage<GlobalLoadStreamA, SharedLoadStreamA> stream_a;
-    // The storage to shuffle the B matrix in shared memory.
-    StreamSharedStorage<GlobalLoadStreamB, SharedLoadStreamB> stream_b;
-    // The storage to clear the accumulators if needed.
-    typename ClearAccumulators::SharedStorage clear;
-  };
-
-  /// The storage in shared memory.
-  union SharedStorage {
-    // The storage for the main loop.
-    MainLoopSharedStorage main_loop;
-    // The storage for the epilogue.
-    typename Epilogue::SharedStorage epilogue;
-  };
-
-  /// Assemble the global load streams for A/B.
-  struct GlobalLoadStream {
-    /// Ctor.
-    CUTLASS_DEVICE GlobalLoadStream(Params const& params,
-                                    SharedStorage& shared_storage,
-                                    dim3 const& block)
-        : stream_a(params.global_stream_a,
-                   shared_storage.main_loop.stream_a.global,
-                   cutlass::make_Coord(0, params.k, params.m),
-                   cutlass::make_Coord(0, 0, block.x)),
-          stream_b(params.global_stream_b,
-                   shared_storage.main_loop.stream_b.global,
-                   cutlass::make_Coord(0, params.k, params.n),
-                   make_Coord(0, 0, block.y)) {}
-
-    /// Trigger the copies from shared memory to registers.
-    CUTLASS_DEVICE void copy() {
-      stream_a.copy();
-      stream_b.copy();
-    }
-
-    /// Commit the data.
-    CUTLASS_DEVICE void commit() {
-      stream_a.commit();
-      stream_b.commit();
-    }
-
-    /// Execute the residue code.
-    CUTLASS_DEVICE void residue(Index k, bool skip_clear = false) {
-      stream_a.residue(k, skip_clear);
-      stream_b.residue(k, skip_clear);
-    }
-
-    /// The stream for A.
-    GlobalLoadStreamA stream_a;
-    /// The stream for B.
-    GlobalLoadStreamB stream_b;
-  };
-
-  /// Assemble the shared load stream for A/B.
-  struct SharedLoadStream {
-    /// Ctor.
-    CUTLASS_DEVICE SharedLoadStream(Params const& params, SharedStorage& shared_storage) {
-      stream_a.initialize(params.shared_stream_a, shared_storage.main_loop.stream_a.shared);
-      stream_b.initialize(params.shared_stream_b, shared_storage.main_loop.stream_b.shared);
-    }
-
-    /// Trigger the copies from shared memory to registers.
-    CUTLASS_DEVICE void copy(int step) {
-      stream_a.copy(step, fetched_a[step % 2]);
-      stream_b.copy(step, fetched_b[step % 2]);
-    }
-
-    /// Commit the data.
-    CUTLASS_DEVICE void commit(int step) {
-      stream_a.commit(fetched_a[step % 2], transformed_a[step % 2]);
-      stream_b.commit(fetched_b[step % 2], transformed_b[step % 2]);
-    }
-
-    /// The fragment A.
-    CUTLASS_DEVICE typename SharedLoadStreamA::Fragment const& fragment_a(int step) const {
-      return transformed_a[step % 2];
-    }
-
-    /// The fragment B.
-    CUTLASS_DEVICE typename SharedLoadStreamB::Fragment const& fragment_b(int step) const {
-      return transformed_b[step % 2];
-    }
-
-    /// Increment the stage.
-    CUTLASS_DEVICE void inc_stage() {
-      stream_a.inc_stage();
-      stream_b.inc_stage();
-    }
-
-    /// The stream for A.
-    SharedLoadStreamA stream_a;
-    /// The fragments to fetch A.
-    typename SharedLoadStreamA::FetchedFragment fetched_a[2];
-    /// The fragments to transform A.
-    typename SharedLoadStreamA::TransformedFragment transformed_a[2];
-    /// The stream for B.
-    SharedLoadStreamB stream_b;
-    /// The fragments to fetch B.
-    typename SharedLoadStreamB::FetchedFragment fetched_b[2];
-    /// The fragments to transform B.
-    typename SharedLoadStreamB::TransformedFragment transformed_b[2];
-  };
-
-  /// The memory fence for shared loads.
-  static CUTLASS_DEVICE void shared_load_fence(bool in_loop) {
-    if (SharedLoadStreamA::Iterator::kRequiresLoadFence ||
-        SharedLoadStreamB::Iterator::kRequiresLoadFence) {
-      __syncthreads();
-    }
-  }
-
-  /// The memory fence for shared stores.
-  static CUTLASS_DEVICE void shared_store_fence(bool in_loop) { __syncthreads(); }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmTileTraitsHelperA_, typename GemmTileTraitsHelperB_, typename Index_>
-struct SimplifiedGemmTraitsHelper {
-  /// The global iterator to load A from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA_::GlobalTileTraits, Index_>
-      GlobalLoadIteratorA;
-  /// The data converter for A before storing to shared memory.
-  typedef Copy<typename GlobalLoadIteratorA::Fragment> GlobalTransformerA;
-  /// The iterator to store A to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperA_::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperA_::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorA;
-  /// The stream to load A from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
-      GlobalLoadStreamA;
-
-  /// The global iterator to load B from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB_::GlobalTileTraits, Index_>
-      GlobalLoadIteratorB;
-  /// The data converter for B before storing to shared memory.
-  typedef Copy<typename GlobalLoadIteratorB::Fragment> GlobalTransformerB;
-  /// The iterator to store B to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperB_::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperB_::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorB;
-  /// The stream to load B from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
-      GlobalLoadStreamB;
-
-  /// The iterator to load A from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperA_::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperA_::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorA;
-  /// The stream to load A from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
-  /// The iterator to load B from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperB_::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperB_::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorB;
-  /// The stream to load B from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The config for the GEMM.
-    typename GemmConfig_,
-    /// The epilogue.
-    typename Epilogue_,
-    /// The index.
-    typename Index_ = int,
-    // The configuration for the A matrix.
-    typename GemmTileTraitsHelperA_ = GemmTileTraitsHelperA<kLayoutA_, GemmConfig_>,
-    // The configuration for the B matrix.
-    typename GemmTileTraitsHelperB_ = GemmTileTraitsHelperB<kLayoutB_, GemmConfig_>,
-    // The helper class to create the streams and iterators.
-    typename Helper_ =
-        SimplifiedGemmTraitsHelper<GemmTileTraitsHelperA_, GemmTileTraitsHelperB_, Index_> >
-struct SimplifiedGemmTraits : public GemmTraits<
-                                  // The config.
-                                  GemmConfig_,
-                                  // The stream to load A from global memory to shared memory.
-                                  typename Helper_::GlobalLoadStreamA,
-                                  // The stream to load B from global memory to shared memory.
-                                  typename Helper_::GlobalLoadStreamB,
-                                  // The stream to load A from shared memory.
-                                  typename Helper_::SharedLoadStreamA,
-                                  // The stream to load B from shared memory.
-                                  typename Helper_::SharedLoadStreamB,
-                                  // The epilogue.
-                                  Epilogue_,
-                                  // The block swizzle to reorganize the grid.
-                                  IdentityBlockSwizzle,
-                                  // The index.
-                                  Index_,
-                                  // The tool used to clear accumulators.
-                                  ClearAccumulators<typename GemmConfig_::Accumulators::Element> > {
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/hgemm_global_tile.h
+++ b/cutlass/gemm/hgemm_global_tile.h
@ -1,90 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Tile traits used to construct global tile iterator for HGEMM. This is intended to
-      partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate
-      memory accesses larger than 16 bits.
-*/
-#pragma once
-
-#include <cutlass/coord.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/matrix_traits.h>
-#include <cutlass/reshape_tile.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GemmOperand::Kind kOperand_,
-          MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename Tile_,
-          typename Threads_,
-          int kAccessSize_>
-struct HgemmCrosswiseGlobalTileTraits : public GemmGlobalTileTraits<
-                                            // Which GEMM operand?
-                                            kOperand_,
-                                            // The layout.
-                                            kLayout_,
-                                            // The scalar.
-                                            Scalar_,
-                                            // The tile.
-                                            Tile_,
-                                            // The threads.
-                                            Threads_,
-                                            // The number of scalars per LDG/STG.
-                                            kAccessSize_> {
-  /// The base class.
-  typedef GemmGlobalTileTraits<kOperand_, kLayout_, Scalar_, Tile_, Threads_, kAccessSize_> Base;
-  /// The threads.
-  typedef typename Base::Threads Threads;
-  /// The threads strides.
-  typedef Shape<1, 2, Base::Tile::kC> ThreadsDelta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<Base::Threads::kH * 2, 1, Base::Threads::kW, Base::kAccessSize> Delta;
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<Base::Tile::kH / Base::Threads::kH / 2,
-                2,
-                Base::Tile::kW / Base::Threads::kW,
-                Base::Tile::kC / Base::kAccessSize>
-      Iterations;
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
-      int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
-
-      return make_Coord(0, thread_offset_h, thread_offset_w, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/hgemm_multiply_add.h
+++ b/cutlass/gemm/hgemm_multiply_add.h
@ -1,104 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Specialization implementing multiply-add operation on half-precision floating point
-   fragments.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-#include <cutlass/gemm/thread_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Template performing matrix multiply-add operation within a thread
-template <typename AccumulatorsPerThread_, typename ThreadsPerWarp_>
-struct ThreadMultiplyAdd<AccumulatorsPerThread_, ThreadsPerWarp_, half, half, half> {
-  /// The shape of the instruction.
-  typedef Shape<1, 1, 2, 1> InstructionShape;
-  /// The number of accumulators per thread.
-  typedef AccumulatorsPerThread_ AccumulatorsPerThread;
-  /// The number of threads per warp.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of accumulators per warp.
-  typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
-  /// The type for A.
-  typedef half ScalarA;
-  /// The fragment for A.
-  typedef Fragment<ScalarA, AccumulatorsPerThread::kW> FragmentA;
-  /// The type for B.
-  typedef half ScalarB;
-  /// The fragment for B.
-  typedef Fragment<ScalarB, AccumulatorsPerThread::kH> FragmentB;
-  /// The type for C and D.
-  typedef half ScalarC;
-  /// The accumulators.
-  typedef Fragment<half, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW> Accumulators;
-
-  /// Make sure there's an even number of elements in both dimensions.
-  static_assert(AccumulatorsPerThread::kH % 2 == 0, "Invalid size");
-  static_assert(AccumulatorsPerThread::kW % 2 == 0, "Invalid size");
-
-  /// Ctor.
-  CUTLASS_DEVICE ThreadMultiplyAdd() {}
-
-  /// Multiply : d = a*b + c.
-  CUTLASS_DEVICE void multiply_add(FragmentA const& a,
-                                   FragmentB const& b,
-                                   Accumulators const& c,
-                                   Accumulators& d) {
-#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
-    // The inputs.
-    __half2 const* a_half2 = reinterpret_cast<__half2 const*>(&a[0]);
-    __half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
-    __half2 const* c_half2 = reinterpret_cast<__half2 const*>(&c[0]);
-
-    // The output.
-    __half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
-
-    for (int j = 0; j < AccumulatorsPerThread::kH / 2; ++j) {
-      for (int i = 0; i < AccumulatorsPerThread::kW / 2; ++i) {
-        // The offsets in the output fragment.
-        int const k0 = (2 * j + 0) * (AccumulatorsPerThread::kW / 2) + i;
-        int const k1 = (2 * j + 1) * (AccumulatorsPerThread::kW / 2) + i;
-
-        // Compute the product a[i] * b[j].H0_H0.
-        d_half2[k0] = __hfma2(a_half2[i], __low2half2(b_half2[j]), c_half2[k0]);
-        // Compute the product a[i] * b[j].H1_H1.
-        d_half2[k1] = __hfma2(a_half2[i], __high2half2(b_half2[j]), c_half2[k1]);
-      }
-    }
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/hgemm_swizzle.h
+++ b/cutlass/gemm/hgemm_swizzle.h
@ -1,94 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in
-      shared memory for multiplicands.
-*/
-#pragma once
-
-#include <cuda_fp16.h>
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GlobalIterator_>
-struct HgemmSwizzle {
-  /// The global iterator.
-  typedef GlobalIterator_ GlobalIterator;
-  /// The source fragment.
-  typedef typename GlobalIterator::Fragment Fragment;
-  /// The shape of the source fragment.
-  typedef typename GlobalIterator::FragmentShape FragmentShape;
-
-  /// The input fragment.
-  typedef Fragment InputFragment;
-  /// The output fragment.
-  typedef Fragment OutputFragment;
-
-  /// The src/dst must be half fragments.
-  static_assert((platform::is_same<typename Fragment::Element, half>::value), "Works on half");
-
-  /// The number of elements must be a multiple of 2.
-  static_assert(FragmentShape::kH == 2 && ShapeCount<FragmentShape>::kWc == 2, "Not multiple of 2");
-
-  /// Ctor.
-  CUTLASS_DEVICE HgemmSwizzle() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(Fragment const& src, Fragment& dst) {
-    // Expose src/dst as int arrays.
-    int const* src_int = reinterpret_cast<int const*>(&src[0]);
-    int* dst_int = reinterpret_cast<int*>(&dst[0]);
-
-    // Transpose the data.
-    for (int d = 0; d < FragmentShape::kD; ++d) {
-      // The indices to read two consecutive "rows".
-      int const i0 = 2 * d + 0;
-      int const i1 = 2 * d + 1;
-
-      int a0 = src_int[i0];
-      int a1 = src_int[i1];
-
-      int b0, b1;
-      asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b0) : "r"(a0), "r"(a1));
-      asm volatile("prmt.b32 %0, %1, %2, 0x7632;" : "=r"(b1) : "r"(a0), "r"(a1));
-
-      // The indices to store with "strides".
-      int const j0 = 0 * (ShapeCount<FragmentShape>::kDhw / 2) + d;
-      int const j1 = 1 * (ShapeCount<FragmentShape>::kDhw / 2) + d;
-
-      dst_int[j0] = b0;
-      dst_int[j1] = b1;
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/hgemm_traits.h
+++ b/cutlass/gemm/hgemm_traits.h
@ -1,391 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defies structural properties of half-precision GEMM computation.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/reshape_tile.h>
-
-#include <cutlass/gemm/gemm.h>
-#include <cutlass/gemm/gemm_epilogue.h>
-#include <cutlass/gemm/gemm_epilogue_traits.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/gemm/gemm_shared_tile.h>
-#include <cutlass/gemm/gemm_traits.h>
-#include <cutlass/gemm/hgemm_global_tile.h>
-#include <cutlass/gemm/hgemm_multiply_add.h>
-#include <cutlass/gemm/hgemm_swizzle.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_ = 2,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_ = 2>
-struct HgemmConfig
-    : public GemmConfig<
-          /// The scalar type for A.
-          half,
-          /// The scalar type for B.
-          half,
-          /// The scalar type for C.
-          half,
-          /// The scalar type for D.
-          half,
-          /// The tile size for the GEMM KxNxM.
-          OutputTile_,
-          /// The functor to do the math in the main loop.
-          ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, half, half, half>,
-          /// The number of scalars per LDG for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per STS for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per LDS for A.
-          8,
-          /// The number of scalars per LDG for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per STS for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per LDS for B.
-          8,
-          /// The number of scalars per LDG for C and STG for D.
-          2,
-          /// The number of scalars per STS for D.
-          8,
-          /// The number of scalars per LDS for D.
-          2,
-          /// The number of stages in shared memory.
-          2> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
-struct HgemmTransformerA {};
-
-template <typename Iterator_>
-struct HgemmTransformerA<MatrixLayout::kColumnMajor, Iterator_> {
-  typedef Convert<typename Iterator_::Fragment, typename Iterator_::Fragment> Transformer;
-};
-
-template <typename Iterator_>
-struct HgemmTransformerA<MatrixLayout::kRowMajor, Iterator_> {
-  typedef HgemmSwizzle<Iterator_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
-struct HgemmTransformerB {};
-
-template <typename Iterator_>
-struct HgemmTransformerB<MatrixLayout::kRowMajor, Iterator_> {
-  typedef Convert<typename Iterator_::Fragment, typename Iterator_::Fragment> Transformer;
-};
-
-template <typename Iterator_>
-struct HgemmTransformerB<MatrixLayout::kColumnMajor, Iterator_> {
-  typedef HgemmSwizzle<Iterator_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct HgemmTileTraitsHelperA : public GemmTileTraitsHelperA<kLayout_, GemmConfig_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct HgemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_>
-    : public GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> Base;
-
-  /// The traits class to build the iterator to load data from global memory for A^T.
-  typedef HgemmCrosswiseGlobalTileTraits<
-      GemmOperand::kA,
-      // The layout.
-      MatrixLayout::kRowMajor,
-      // The pointer.
-      half const,
-      // The tile has size MxK in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as (threads / K ) x K (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc)
-      GemmConfig_::kScalarsPerLdgA>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for A^T.
-  typedef GemmSharedStoreWithSkewTileAbTraits<
-      // The pointer.
-      half,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as warps x 32(the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      2,
-      // The skew to avoid bank conflicts added in the tile W dimension.
-      128 / sizeof(half) / GlobalTileTraits::Threads::kW / 2>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for A^T.
-  typedef GemmSharedLoadTileATraits<
-      // The pointer.
-      half const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      8,
-      // The skew.
-      SharedStoreTileTraits::kSkew>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct HgemmTileTraitsHelperB : public GemmTileTraitsHelperB<kLayout_, GemmConfig_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct HgemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_>
-    : public GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> Base;
-
-  /// The traits class to build the iterator to load data from global memory for B^N.
-  typedef HgemmCrosswiseGlobalTileTraits<
-      GemmOperand::kB,
-      // The layout.
-      MatrixLayout::kColumnMajor,
-      // The pointer.
-      half const,
-      // The tile has size KxN in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc)
-      GemmConfig_::kScalarsPerLdgB>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for B^N.
-  typedef GemmSharedStoreWithSkewTileAbTraits<
-      // The pointer.
-      half,
-      // The tile has size KxN in GEMM's terminology.
-      Shape<GemmConfig_::kStages,
-            GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
-            GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
-      // The threads are distributed as (threads / K) x K (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      2,
-      // The skew to avoid bank conflicts added in the tile W dimension.
-      128 / sizeof(half) / GlobalTileTraits::Threads::kW / 2>
-      SharedStoreTileTraits;
-
-  /// The traits class to build the iterator to load from shared memory for B^N.
-  typedef GemmSharedLoadTileBTraits<
-      // The pointer.
-      half const,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The number of threads per warp.
-      typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
-      // The shape of the FMA instruction.
-      typename GemmConfig_::InstructionShape,
-      // The number of stages.
-      GemmConfig_::kStages,
-      // The number of scalars per LDS.
-      8,
-      // The skew.
-      SharedStoreTileTraits::kSkew>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
-    /// The number of halfs loaded in one LDG for A.
-    int kScalarsPerLdgA_ = 2,
-    /// The number of halfs loaded in one LDG for B.
-    int kScalarsPerLdgB_ = 2,
-    /// The index.
-    typename Index_ = int>
-struct HgemmTraitsHelper {
-  /// The HGEMM config.
-  typedef HgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>
-      GemmConfig;
-  /// The GEMM config for A.
-  typedef HgemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
-  /// The GEMM config for B.
-  typedef HgemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
-
-  /// The iterator to load A from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
-      GlobalLoadIteratorA;
-  /// The default transformer for A.
-  typedef typename HgemmTransformerA<GemmTileTraitsHelperA::kLayout,
-                                     GlobalLoadIteratorA>::Transformer GlobalTransformerA;
-  /// The iterator to store A to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorA;
-  /// The stream to load A from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
-      GlobalLoadStreamA;
-
-  /// The iterator to load B from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
-      GlobalLoadIteratorB;
-  // The default transformer for B.
-  typedef typename HgemmTransformerB<GemmTileTraitsHelperB::kLayout,
-                                     GlobalLoadIteratorB>::Transformer GlobalTransformerB;
-  /// The iterator to store B to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorB;
-  /// The stream to load B from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
-      GlobalLoadStreamB;
-
-  /// The iterator to load A from shared memory
-  typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorA;
-  /// The stream to load A from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
-  /// The iterator to load B from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorB;
-  /// The stream to load B from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
-
-  /// The functor to do the multiply-add in the main loop.
-  typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
-  /// The object to clear accumulators.
-  typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
-
-  /// The traits class for the epilogue.
-  typedef SimplifiedGemmEpilogueTraits<GemmConfig, EpilogueFunctor_, Index_> GemmEpilogueTraits;
-  /// The epilogue.
-  typedef GemmEpilogue<GemmEpilogueTraits> Epilogue;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_ = Shape<8, 128, 128>,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_ = LinearScaling<half>,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<8, 8, 16>,
-    /// The number of halfs loaded in one LDG for A.
-    int kScalarsPerLdgA_ = 2,
-    /// The number of halfs loaded in one LDG for B.
-    int kScalarsPerLdgB_ = 2,
-    /// The index.
-    typename Index_ = int,
-    /// The helper class.
-    typename Helper_ = HgemmTraitsHelper<kLayoutA_,
-                                         kLayoutB_,
-                                         OutputTile_,
-                                         EpilogueFunctor_,
-                                         AccumulatorsPerThread_,
-                                         kScalarsPerLdgA_,
-                                         kScalarsPerLdgB_,
-                                         Index_> >
-struct HgemmTraits : public GemmTraits<
-                         // The config.
-                         typename Helper_::GemmConfig,
-                         // The stream to load A from global memory to shared memory.
-                         typename Helper_::GlobalLoadStreamA,
-                         // The stream to load B from global memory to shared memory.
-                         typename Helper_::GlobalLoadStreamB,
-                         // The stream to load A from shared memory.
-                         typename Helper_::SharedLoadStreamA,
-                         // The stream to load B from shared memory.
-                         typename Helper_::SharedLoadStreamB,
-                         // The epilogue.
-                         typename Helper_::Epilogue,
-                         // The block swizzle to reorganize the grid.
-                         IdentityBlockSwizzle,
-                         // The index.
-                         Index_,
-                         // The tool used to clear accumulators.
-                         typename Helper_::ClearAccumulators> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/identity_block_swizzle.h
+++ b/cutlass/gemm/identity_block_swizzle.h
@ -1,48 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defies functors for mapping blockIdx to partitions of the GEMM computation.
-
-    Currently, we only implement an identity mapping.
-*/
-#pragma once
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-struct IdentityBlockSwizzle {
-  /// Ctor.
-  CUTLASS_DEVICE IdentityBlockSwizzle() {}
-
-  /// Swizzle the block index.
-  CUTLASS_DEVICE dim3 swizzle() { return blockIdx; }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/igemm_epilogue.h
+++ b/cutlass/gemm/igemm_epilogue.h
@ -1,320 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and
-      floating-point output matrix formats.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/fragment.h>
-#include <cutlass/gemm/gemm_global_stream.h>
-#include <cutlass/gemm/gemm_shared_stream.h>
-#include <cutlass/gemm/igemm_global_tile.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <int kElements_>
-struct IgemmFloatToInt8Converter {
-  /// The input fragment.
-  typedef Fragment<float, kElements_> InputFragment;
-  /// The output fragment.
-  typedef Fragment<int8_t, kElements_> OutputFragment;
-
-  // We are packing 4 floats into int32 registers so we need kElements to be multiple of 4.
-  static_assert(kElements_ % 4 == 0, "kElements must be multiple of 4");
-
-  /// Ctor.
-  CUTLASS_DEVICE IgemmFloatToInt8Converter() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
-    transform(src, 0, dst);
-  }
-
-  /// Transform a fragment.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
-    // The inputs.
-    float4 const* src_f4 = reinterpret_cast<float4 const*>(&src[0]);
-    // The outputs.
-    int* dst_int = reinterpret_cast<int*>(&dst[0]);
-
-    // Iterate over the floats and pack them together to produce ints.
-    for (int i = 0; i < kElements_ / 4; ++i) {
-      // Read the float4.
-      float4 f4 = src_f4[i];
-
-      // Clamp the 4 elements of the floats to the [-128, +127] range.
-      float x = fmaxf(-128.f, fminf(127.f, f4.x));
-      float y = fmaxf(-128.f, fminf(127.f, f4.y));
-      float z = fmaxf(-128.f, fminf(127.f, f4.z));
-      float w = fmaxf(-128.f, fminf(127.f, f4.w));
-
-      // Convert to integers.
-      int ix = (int)x;
-      int iy = (int)y;
-      int iz = (int)z;
-      int iw = (int)w;
-
-      // Extract the lower bytes to build an int32 with 4 int8.
-      asm volatile("prmt.b32 %0, %0, %1, 0x1140;" : "+r"(ix) : "r"(iy));
-      asm volatile("prmt.b32 %0, %0, %1, 0x1140;" : "+r"(iz) : "r"(iw));
-      asm volatile("prmt.b32 %0, %0, %1, 0x5410;" : "+r"(ix) : "r"(iz));
-
-      // Store the int.
-      dst_int[i] = ix;
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename InputScalar_, typename OutputFragment_>
-struct IgemmGlobalStoreTransformer {
-  typedef Convert<Fragment<InputScalar_, OutputFragment_::kElements>, OutputFragment_> Transformer;
-};
-
-template <int kElements_>
-struct IgemmGlobalStoreTransformer<float, Fragment<int8_t, kElements_> > {
-  typedef IgemmFloatToInt8Converter<kElements_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <int kElements_>
-struct IgemmInt8ToFloatConverter {
-  /// The input fragment.
-  typedef Fragment<int8_t, kElements_> InputFragment;
-  /// The output fragment.
-  typedef Fragment<float, kElements_> OutputFragment;
-
-  // We are unpacking 4 int8s from int32.
-  static_assert(kElements_ % 4 == 0, "kElements must be multiple of 4");
-
-  /// Ctor.
-  CUTLASS_DEVICE IgemmInt8ToFloatConverter() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
-    transform(src, 0, dst);
-  }
-
-  /// Transform a fragment.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
-    // The inputs.
-    int const* src_int = reinterpret_cast<int const*>(&src[0]);
-    // The outputs.
-    float4* dst_f4 = reinterpret_cast<float4*>(&dst[0]);
-
-    // Iterate over the int8 and unpack them together to produce floats.
-    for (int i = 0; i < kElements_ / 4; ++i) {
-      // Read the int.
-      int ix, iy, iz, iw = src_int[i];
-
-      // Extract the 4 bytes.
-      asm volatile("prmt.b32 %0, 0x0, %1, 0x4440;" : "=r"(ix) : "r"(iw));
-      asm volatile("prmt.b32 %0, 0x0, %1, 0x4441;" : "=r"(iy) : "r"(iw));
-      asm volatile("prmt.b32 %0, 0x0, %1, 0x4442;" : "=r"(iz) : "r"(iw));
-      asm volatile("prmt.b32 %0, 0x0, %1, 0x4443;" : "=r"(iw) : "r"(iw));
-
-      // The floats.
-      float fx, fy, fz, fw;
-
-      // Convert to floats (make sure we generate I2F.F32.S8).
-      asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fx) : "r"(ix));
-      asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fy) : "r"(iy));
-      asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fz) : "r"(iz));
-      asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fw) : "r"(iw));
-
-      // Store the float4.
-      dst_f4[i] = make_float4(fx, fy, fz, fw);
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename InputFragment_, typename OutputScalar_>
-struct IgemmGlobalLoadTransformer {
-  typedef Convert<InputFragment_, Fragment<OutputScalar_, InputFragment_::kElements> > Transformer;
-};
-
-template <int kElements_>
-struct IgemmGlobalLoadTransformer<Fragment<int8_t, kElements_>, float> {
-  typedef IgemmInt8ToFloatConverter<kElements_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename InputScalar_, typename OutputFragment_>
-struct IgemmSharedStoreTransformer {
-  typedef Convert<Fragment<InputScalar_, OutputFragment_::kElements>, OutputFragment_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename IgemmConfig_, typename EpilogueFunctor_, typename Index_>
-struct IgemmEpilogueTraitsHelper
-    : public GemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> {
-  /// The base class.
-  typedef GemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> Base;
-  /// The config.
-  typedef IgemmConfig_ IgemmConfig;
-
-  /// The scalar type of the epilogue.
-  typedef typename Base::Scalar Scalar;
-  /// The iterations.
-  typedef typename Base::Iterations Iterations;
-  /// The iterations strides.
-  typedef typename Base::Delta Delta;
-
-  /// The traits class for the iterator.
-  typedef typename Base::GlobalLoadTileTraits GlobalLoadTileTraits;
-  /// The iterator to store to shared memory.
-  typedef GemmGlobalIteratorCd<GlobalLoadTileTraits> GlobalLoadIteratorC;
-  /// The fragment that needs to be produced by the load iterator.
-  typedef typename GlobalLoadIteratorC::Fragment GlobalFragmentC;
-  /// The transformer from loaded data to math fragment.
-  typedef
-      typename IgemmGlobalLoadTransformer<GlobalFragmentC, Scalar>::Transformer GlobalTransformerC;
-
-  /// The traits class for the iterator.
-  typedef typename Base::GlobalStoreTileTraits GlobalStoreTileTraits;
-  /// The iterator to store to shared memory.
-  typedef GemmGlobalIteratorCd<GlobalStoreTileTraits> GlobalStoreIteratorD;
-  /// The fragment that needs to be passed to that store iterator.
-  typedef typename GlobalStoreIteratorD::Fragment GlobalFragmentD;
-  /// The transformer from accumulators to shared memory fragments.
-  typedef
-      typename IgemmGlobalStoreTransformer<Scalar, GlobalFragmentD>::Transformer GlobalTransformerD;
-
-  /// The traits class for the shared iterator to store D to shared memory.
-  typedef typename Base::SharedStoreTileTraits SharedStoreTileTraits;
-  /// The shared iterator to store D to shared memory.
-  typedef TileStoreIterator<SharedStoreTileTraits,
-                            typename SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kGlobal>
-      SharedStoreIteratorD;
-  /// The fragment that needs to be passed to that store iterator.
-  typedef typename SharedStoreIteratorD::Fragment SharedStoreFragmentD;
-  /// The transformer from accumulators to shared memory fragments.
-  typedef typename IgemmSharedStoreTransformer<typename IgemmConfig::Accumulators::Element,
-                                               SharedStoreFragmentD>::Transformer
-      SharedStoreTransformerD;
-  /// The traits class for the shared iterator to load D from shared memory.
-  typedef typename Base::SharedLoadTileTraits SharedLoadTileTraits;
-  /// The shared iterator to load D from shared memory.
-  typedef TileLoadIterator<SharedLoadTileTraits,
-                           typename SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorD;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The config.
-    typename IgemmConfig_,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_,
-    /// The index.
-    typename Index_ = int,
-    /// The helper class to assemble the traits.
-    typename Helper_ = IgemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> >
-struct IgemmEpilogueTraits : public GemmEpilogueTraits<
-                                 // The output tile.
-                                 typename IgemmConfig_::OutputTile,
-                                 // The accumulators.
-                                 typename IgemmConfig_::Accumulators,
-                                 // The global iterator for C.
-                                 typename Helper_::GlobalLoadIteratorC,
-                                 // The transformer for C.
-                                 typename Helper_::GlobalTransformerC,
-                                 // The transformer for D.
-                                 typename Helper_::GlobalTransformerD,
-                                 // The global iterator for D.
-                                 typename Helper_::GlobalStoreIteratorD,
-                                 // The iterator to store D to shared memory.
-                                 typename Helper_::SharedStoreIteratorD,
-                                 // The shared store transformer for D.
-                                 typename Helper_::SharedStoreTransformerD,
-                                 // The iterator to load D from shared memory.
-                                 typename Helper_::SharedLoadIteratorD,
-                                 // The iterations.
-                                 typename Helper_::Iterations,
-                                 // The strides between iterations.
-                                 typename Helper_::Delta,
-                                 // The functor to be used in the epilogue.
-                                 EpilogueFunctor_,
-                                 // The index.
-                                 Index_> {
-  /// Do we output in int8?
-  static bool const kInt8Output =
-      platform::is_same<typename IgemmConfig_::ScalarC, int8_t>::value != 0;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmEpilogueTraits_, bool = GemmEpilogueTraits_::kInt8Output>
-struct IgemmEpilogue : public GemmEpilogue<GemmEpilogueTraits_> {
-  /// The base class.
-  typedef GemmEpilogue<GemmEpilogueTraits_> Base;
-
-  /// Ctor.
-  CUTLASS_DEVICE IgemmEpilogue(typename Base::Params const& params_,
-                               typename Base::SharedStorage& shared_storage_,
-                               typename Base::Index m_,
-                               typename Base::Index n_)
-      : Base(params_, shared_storage_, m_, n_) {}
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmEpilogueTraits_>
-struct IgemmEpilogue<GemmEpilogueTraits_, true> : public GemmEpilogue<GemmEpilogueTraits_> {
-  /// The base class.
-  typedef GemmEpilogue<GemmEpilogueTraits_> Base;
-
-  /// Ctor.
-  CUTLASS_DEVICE IgemmEpilogue(typename Base::Params const& params_,
-                               typename Base::SharedStorage& shared_storage_,
-                               typename Base::Index m_,
-                               typename Base::Index n_)
-      : Base(params_, shared_storage_, m_, n_) {}
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/igemm_global_tile.h
+++ b/cutlass/gemm/igemm_global_tile.h
@ -1,95 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements tile iterators to partition the thread block tile into 2D subtiles and
-      efficiently load each. Applies permute transformation to construct 'interleaved K-strided'
-      data layout in which 4-element dot products from the same K index are arranged in consecutive
-      locations within shared memory.
-
-      Supports efficient loads from shared memory to target the DP4A instruction.
-*/
-#pragma once
-
-#include <cutlass/coord.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/matrix_traits.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GemmOperand::Kind kOperand_,
-          MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename Tile_,
-          typename Threads_,
-          int kAccessSize_>
-struct IgemmContiguousGlobalTileTraits : public GemmGlobalTileTraits<
-                                             // Which GEMM operand?
-                                             kOperand_,
-                                             // The layout.
-                                             kLayout_,
-                                             // The scalar.
-                                             Scalar_,
-                                             // The tile.
-                                             Tile_,
-                                             // The threads.
-                                             Threads_,
-                                             // The number of scalars per LDG/STG.
-                                             kAccessSize_> {
-  /// The base class.
-  typedef GemmGlobalTileTraits<kOperand_, kLayout_, Scalar_, Tile_, Threads_, kAccessSize_> Base;
-  /// The threads.
-  typedef typename Base::Threads Threads;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<Base::Threads::kH * 4, 1, Base::Threads::kW, Base::kAccessSize> Delta;
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<Base::Tile::kH / Base::Threads::kH / 4,
-                4,
-                Base::Tile::kW / Base::Threads::kW,
-                Base::Tile::kC / Base::kAccessSize>
-      Iterations;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
-      int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
-
-      return make_Coord(0, thread_offset_h, thread_offset_w, 0);
-    }
-  };
-
- public:
-  /// The threads strides.
-  typedef Shape<1, 4, Base::Tile::kC> ThreadsDelta;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/igemm_multiply_add.h
+++ b/cutlass/gemm/igemm_multiply_add.h
@ -1,89 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements matrix multiply accumulate operation of 8-bit integer data using DP4A
-   instruction.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-#include <cutlass/gemm/thread_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Template performing matrix multiply-add operation within a thread
-template <typename AccumulatorsPerThread_, typename ThreadsPerWarp_>
-struct ThreadMultiplyAdd<AccumulatorsPerThread_, ThreadsPerWarp_, int8_t, int8_t, int> {
-  /// The shape of the instruction.
-  typedef Shape<4, 1, 1> InstructionShape;
-  /// The number of accumulators per thread.
-  typedef AccumulatorsPerThread_ AccumulatorsPerThread;
-  /// The number of threads per warp.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of accumulators per warp.
-  typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
-  /// The type for A.
-  typedef int8_t ScalarA;
-  /// The fragment for A.
-  typedef Fragment<ScalarA, AccumulatorsPerThread::kW * 4> FragmentA;
-  /// The type for B.
-  typedef int8_t ScalarB;
-  /// The fragment for B.
-  typedef Fragment<ScalarB, AccumulatorsPerThread::kH * 4> FragmentB;
-  /// The type for C and D.
-  typedef int ScalarC;
-  /// The accumulators.
-  typedef Fragment<ScalarC, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW> Accumulators;
-
-  /// Ctor.
-  CUTLASS_DEVICE ThreadMultiplyAdd() {}
-
-  /// Multiply : d = a*b + c.
-  CUTLASS_DEVICE void multiply_add(FragmentA const& a,
-                                   FragmentB const& b,
-                                   Accumulators const& c,
-                                   Accumulators& d) {
-    // The inputs.
-    int const* a_int = reinterpret_cast<int const*>(&a[0]);
-    int const* b_int = reinterpret_cast<int const*>(&b[0]);
-
-    for (int j = 0; j < AccumulatorsPerThread::kH; ++j) {
-      for (int i = 0; i < AccumulatorsPerThread::kW; ++i) {
-        asm volatile("dp4a.s32.s32 %0, %1, %2, %3;"
-                     : "=r"(d[j * AccumulatorsPerThread::kW + i])
-                     : "r"(a_int[i]), "r"(b_int[j]), "r"(c[j * AccumulatorsPerThread::kW + i]));
-      }
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/igemm_swizzle.h
+++ b/cutlass/gemm/igemm_swizzle.h
@ -1,115 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Transposes a fragment of data containing packed 8-bit integer elements.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GlobalIterator_>
-struct IgemmSwizzle {
-  /// The global iterator.
-  typedef GlobalIterator_ GlobalIterator;
-  /// The source fragment.
-  typedef typename GlobalIterator::Fragment Fragment;
-  /// The shape of the source fragment.
-  typedef typename GlobalIterator::FragmentShape FragmentShape;
-
-  /// The source fragment.
-  typedef Fragment InputFragment;
-  /// The destination fragment.
-  typedef Fragment OutputFragment;
-
-  /// The src/dst must be int8 fragments.
-  static_assert((platform::is_same<typename Fragment::Element, int8_t>::value), "Works on int8");
-
-  /// The number of elements must be a multiple of 4.
-  static_assert(FragmentShape::kH % 4 == 0 && ShapeCount<FragmentShape>::kWc % 4 == 0,
-                "Not multiple of 4");
-
-  /// Ctor.
-  CUTLASS_DEVICE IgemmSwizzle() {}
-
-  /// Transform a fragment.
-  CUTLASS_DEVICE void transform(Fragment const& src, Fragment& dst) {
-    // Expose src/dst as int arrays.
-    int const* src_int = reinterpret_cast<int const*>(&src[0]);
-    int* dst_int = reinterpret_cast<int*>(&dst[0]);
-
-    // Transpose the data.
-    for (int d = 0; d < FragmentShape::kD; ++d) {
-      for (int h = 0; h < FragmentShape::kH / 4; ++h) {
-        for (int w = 0; w < ShapeCount<FragmentShape>::kWc / 4; ++w) {
-          int const i0 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
-                         (4 * h + 0) * (ShapeCount<FragmentShape>::kWc / 4) + w;
-          int const i1 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
-                         (4 * h + 1) * (ShapeCount<FragmentShape>::kWc / 4) + w;
-          int const i2 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
-                         (4 * h + 2) * (ShapeCount<FragmentShape>::kWc / 4) + w;
-          int const i3 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
-                         (4 * h + 3) * (ShapeCount<FragmentShape>::kWc / 4) + w;
-
-          int a0 = src_int[i0];
-          int a1 = src_int[i1];
-          int a2 = src_int[i2];
-          int a3 = src_int[i3];
-
-          int b0, b1, b2, b3, c0;
-          asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(b0) : "r"(a0), "r"(a1));
-          asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(c0) : "r"(a2), "r"(a3));
-          asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b0) : "r"(b0), "r"(c0));
-
-          asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(b1) : "r"(a0), "r"(a1));
-          asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(c0) : "r"(a2), "r"(a3));
-          asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b1) : "r"(b1), "r"(c0));
-
-          asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(b2) : "r"(a0), "r"(a1));
-          asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(c0) : "r"(a2), "r"(a3));
-          asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b2) : "r"(b2), "r"(c0));
-
-          asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(b3) : "r"(a0), "r"(a1));
-          asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(c0) : "r"(a2), "r"(a3));
-          asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b3) : "r"(b3), "r"(c0));
-
-          dst_int[i0] = b0;
-          dst_int[i1] = b1;
-          dst_int[i2] = b2;
-          dst_int[i3] = b3;
-        }
-      }
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/igemm_traits.h
+++ b/cutlass/gemm/igemm_traits.h
@ -1,393 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed
-      to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output
-      formats vary.
-*/
-#pragma once
-
-#include <cutlass/convert.h>
-#include <cutlass/gemm/gemm.h>
-#include <cutlass/gemm/gemm_epilogue.h>
-#include <cutlass/gemm/gemm_epilogue_traits.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/gemm/gemm_shared_tile.h>
-#include <cutlass/gemm/gemm_traits.h>
-#include <cutlass/gemm/igemm_epilogue.h>
-#include <cutlass/gemm/igemm_global_tile.h>
-#include <cutlass/gemm/igemm_multiply_add.h>
-#include <cutlass/gemm/igemm_swizzle.h>
-#include <cutlass/reshape_tile.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The output type.
-    typename ScalarD_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_>
-struct IgemmConfig
-    : public GemmConfig<
-          /// The scalar type for A.
-          int8_t,
-          /// The scalar type for B.
-          int8_t,
-          /// The scalar type for C.
-          ScalarD_,
-          /// The scalar type for D.
-          ScalarD_,
-          /// The tile size for the GEMM KxNxM.
-          OutputTile_,
-          /// The functor to do the math in the main loop.
-          ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, int8_t, int8_t, int>,
-          /// The number of scalars per LDG for A.
-          4,
-          /// The number of scalars per STS for A.
-          4,
-          /// The number of scalars per LDS for A.
-          16,
-          /// The number of scalars per LDG for B.
-          4,
-          /// The number of scalars per STS for B.
-          4,
-          /// The number of scalars per LDS for B.
-          16,
-          /// The number of scalars per LDG for C and STG for D.
-          1,
-          /// The number of scalars per STS for D.
-          4,
-          /// The number of scalars per LDS for D.
-          1,
-          /// The number of stages in shared memory.
-          2> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename OutputTile_, typename AccumulatorsPerThread_>
-struct IgemmConfig<OutputTile_, int8_t, AccumulatorsPerThread_>
-    : public GemmConfig<
-          /// The scalar type for A.
-          int8_t,
-          /// The scalar type for B.
-          int8_t,
-          /// The scalar type for C.
-          int8_t,
-          /// The scalar type for D.
-          int8_t,
-          /// The tile size for the GEMM KxNxM.
-          OutputTile_,
-          /// The functor to do the math in the main loop.
-          ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, int8_t, int8_t, int>,
-          /// The number of scalars per LDG for A.
-          4,
-          /// The number of scalars per STS for A.
-          4,
-          /// The number of scalars per LDS for A.
-          16,
-          /// The number of scalars per LDG for B.
-          4,
-          /// The number of scalars per STS for B.
-          4,
-          /// The number of scalars per LDS for B.
-          16,
-          /// The number of scalars per LDG for C and STG for D.
-          4,
-          /// The number of scalars per STS for D.
-          4,
-          /// The number of scalars per LDS for D.
-          4,
-          /// The number of stages in shared memory.
-          2> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct IgemmTileTraitsHelperA : public GemmTileTraitsHelperA<kLayout_, GemmConfig_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct IgemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_>
-    : public GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> Base;
-
-  /// The number of scalars per LDG/STS/LDS for A.
-  static int const kScalarsPerStsA = 16;
-
-  /// The traits class to build the iterator to load data from global memory for A^N.
-  typedef IgemmContiguousGlobalTileTraits<
-      GemmOperand::kA,
-      // The layout.
-      MatrixLayout::kColumnMajor,
-      // The pointer is float const.
-      int8_t const,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      4>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for A^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer is float.
-      int8_t,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<GemmConfig_::kStages, GemmConfig_::OutputTile::kD / 4, GemmConfig_::OutputTile::kW * 4>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      kScalarsPerStsA>
-      SharedStoreTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct IgemmTileTraitsHelperB : public GemmTileTraitsHelperB<kLayout_, GemmConfig_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct IgemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_>
-    : public GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> Base;
-
-  /// The number of scalars per LDG/STS/LDS for B.
-  static int const kScalarsPerStsB = 16;
-
-  /// The traits class to build the iterator to load data from global memory for B^T.
-  typedef IgemmContiguousGlobalTileTraits<
-      GemmOperand::kB,
-      // The layout.
-      MatrixLayout::kRowMajor,
-      // The pointer is float const.
-      int8_t const,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kH>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      4>
-      GlobalTileTraits;
-
-  /// The traits class to build the iterator to store data to shared memory for B^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer is float.
-      int8_t,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<GemmConfig_::kStages, GemmConfig_::OutputTile::kD / 4, GemmConfig_::OutputTile::kH * 4>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      kScalarsPerStsB>
-      SharedStoreTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
-struct IgemmTransformerA {};
-
-template <typename Iterator_>
-struct IgemmTransformerA<MatrixLayout::kRowMajor, Iterator_> {
-  typedef Copy<typename Iterator_::Fragment> Transformer;
-};
-
-template <typename Iterator_>
-struct IgemmTransformerA<MatrixLayout::kColumnMajor, Iterator_> {
-  typedef IgemmSwizzle<Iterator_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
-struct IgemmTransformerB {};
-
-template <typename Iterator_>
-struct IgemmTransformerB<MatrixLayout::kColumnMajor, Iterator_> {
-  typedef Copy<typename Iterator_::Fragment> Transformer;
-};
-
-template <typename Iterator_>
-struct IgemmTransformerB<MatrixLayout::kRowMajor, Iterator_> {
-  typedef IgemmSwizzle<Iterator_> Transformer;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_,
-    /// The output type.
-    typename ScalarD_,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
-    /// The index.
-    typename Index_ = int>
-struct IgemmTraitsHelper {
-  /// The IGEMM config.
-  typedef IgemmConfig<OutputTile_, ScalarD_, AccumulatorsPerThread_> GemmConfig;
-  /// The GEMM config for A.
-  typedef IgemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
-  /// The GEMM config for B.
-  typedef IgemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
-
-  /// The iterator to load A from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
-      GlobalLoadIteratorA;
-  /// The default transformer for A.
-  typedef typename IgemmTransformerA<GemmTileTraitsHelperA::kLayout,
-                                     GlobalLoadIteratorA>::Transformer GlobalTransformerA;
-  /// The iterator to store A to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorA;
-  /// The stream to load A from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
-      GlobalLoadStreamA;
-
-  /// The iterator to load B from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
-      GlobalLoadIteratorB;
-  // The default transformer for B.
-  typedef typename IgemmTransformerB<GemmTileTraitsHelperB::kLayout,
-                                     GlobalLoadIteratorB>::Transformer GlobalTransformerB;
-  /// The iterator to store B to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorB;
-  /// The stream to load B from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
-      GlobalLoadStreamB;
-
-  /// The iterator to load A from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorA;
-  /// The stream to load A from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorA, Copy<typename SharedLoadIteratorA::Fragment> >
-      SharedLoadStreamA;
-  /// The iterator to load B from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorB;
-  /// The stream to load B from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorB, Copy<typename SharedLoadIteratorB::Fragment> >
-      SharedLoadStreamB;
-
-  /// The multiply-add functor.
-  typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
-  /// The object to clear accumulators.
-  typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
-
-  /// The epilogue.
-  typedef IgemmEpilogue<IgemmEpilogueTraits<GemmConfig, EpilogueFunctor_> > Epilogue;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename ScalarD_>
-struct IgemmEpilogueScalar {
-  typedef float Scalar;
-};
-
-template <>
-struct IgemmEpilogueScalar<int> {
-  typedef int Scalar;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_ = Shape<32, 128, 128>,
-    /// The output type.
-    typename ScalarD_ = int,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_ = LinearScaling<typename IgemmEpilogueScalar<ScalarD_>::Scalar>,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
-    /// The index.
-    typename Index_ = int,
-    /// The helper class.
-    typename Helper_ = IgemmTraitsHelper<kLayoutA_,
-                                         kLayoutB_,
-                                         OutputTile_,
-                                         ScalarD_,
-                                         EpilogueFunctor_,
-                                         AccumulatorsPerThread_,
-                                         Index_> >
-struct IgemmTraits : public GemmTraits<
-                         // The config.
-                         typename Helper_::GemmConfig,
-                         // The stream to load A from global memory to shared memory.
-                         typename Helper_::GlobalLoadStreamA,
-                         // The stream to load B from global memory to shared memory.
-                         typename Helper_::GlobalLoadStreamB,
-                         // The stream to load A from shared memory.
-                         typename Helper_::SharedLoadStreamA,
-                         // The stream to load B from shared memory.
-                         typename Helper_::SharedLoadStreamB,
-                         // The epilogue.
-                         typename Helper_::Epilogue,
-                         // The block swizzle to reorganize the grid.
-                         IdentityBlockSwizzle,
-                         // The index.
-                         Index_,
-                         // The tool used to clear accumulators.
-                         typename Helper_::ClearAccumulators> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/linear_scaling.h
+++ b/cutlass/gemm/linear_scaling.h
@ -1,86 +0,0 @@
-
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements the BLAS linear scaling function alpha*AB + beta*C
-*/
-#pragma once
-
-#include <cutlass/fragment_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Functor to compute linear combination of fragments
-template <typename Scalar_, typename FragmentMultiplyAdd_ = FragmentMultiplyAdd<Scalar_> >
-struct LinearScaling {
-  // The scalar.
-  typedef Scalar_ Scalar;
-  // The adapater.
-  typedef FragmentMultiplyAdd_ FragmentMultiplyAdd;
-
-  /// The parameters.
-  struct Params {
-    /// The alpha/beta scaling params.
-    Scalar alpha, beta;
-
-    /// Initialize the parameters.
-    template <typename GemmDesc_>
-    CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
-      alpha = desc.alpha;
-      beta = desc.beta;
-      return 0;
-    }
-  };
-
-  /// Ctor.
-  CUTLASS_DEVICE LinearScaling(Params const& params) : alpha(params.alpha), beta(params.beta) {}
-
-  /// Evaluate the functor.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void evaluate(Fragment_ const& accum, Fragment_& output) {
-    FragmentMultiplyAdd mad;
-    mad.multiply(alpha, accum, output);
-  }
-
-  /// Evaluate the functor.
-  template <typename Fragment_>
-  CUTLASS_DEVICE void evaluate(Fragment_ const& accum, Fragment_ const& old, Fragment_& output) {
-    FragmentMultiplyAdd mad;
-    Fragment_ tmp;
-    mad.multiply(beta, old, tmp);
-    mad.multiply_add(alpha, accum, tmp, output);
-  }
-
-  /// The alpha/beta scaling factors.
-  Scalar alpha, beta;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/sgemm_traits.h
+++ b/cutlass/gemm/sgemm_traits.h
@ -1,127 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defies structural properties of single-precision GEMM.
-*/
-#pragma once
-
-#include <cutlass/gemm/gemm.h>
-#include <cutlass/gemm/gemm_epilogue.h>
-#include <cutlass/gemm/gemm_epilogue_traits.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/gemm/gemm_shared_tile.h>
-#include <cutlass/gemm/gemm_traits.h>
-#include <cutlass/gemm/thread_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_ = 1,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_ = 1>
-struct SgemmConfig
-    : public GemmConfig<
-          /// The scalar type for A.
-          float,
-          /// The scalar type for B.
-          float,
-          /// The scalar type for C.
-          float,
-          /// The scalar type for D.
-          float,
-          /// The tile size for the GEMM KxNxM.
-          OutputTile_,
-          /// The functor to do the math in the main loop.
-          ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, float, float, float>,
-          /// The number of scalars per LDG for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per STS for A.
-          kScalarsPerLdgA_,
-          /// The number of scalars per LDS for A.
-          4,
-          /// The number of scalars per LDG for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per STS for B.
-          kScalarsPerLdgB_,
-          /// The number of scalars per LDS for B.
-          4,
-          /// The number of scalars per LDG for C and STG for D.
-          1,
-          /// The number of scalars per STS for D.
-          4,
-          /// The number of scalars per LDS for D.
-          1,
-          /// The number of stages in shared memory.
-          2> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_ = Shape<8, 128, 128>,
-    /// The functor to use in the epilogue.
-    typename EpilogueFunctor_ = LinearScaling<float>,
-    /// The number of accumulators per thread.
-    typename AccumulatorsPerThread_ = Shape<8, 8, 8>,
-    /// The number of floats loaded in one LDG for A.
-    int kScalarsPerLdgA_ = 1,
-    /// The number of floats loaded in one LDG for B.
-    int kScalarsPerLdgB_ = 1,
-    /// The index.
-    typename Index_ = int,
-    /// The SGEMM config.
-    typename GemmConfig_ =
-        SgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>,
-    /// The traits class for the epilogue.
-    typename GemmEpilogueTraits_ =
-        SimplifiedGemmEpilogueTraits<GemmConfig_, EpilogueFunctor_, Index_> >
-struct SgemmTraits : public SimplifiedGemmTraits<
-                         // The layout for A.
-                         kLayoutA_,
-                         // The layout for B.
-                         kLayoutB_,
-                         // The config.
-                         GemmConfig_,
-                         // The epilogue.
-                         GemmEpilogue<GemmEpilogueTraits_>,
-                         // The index.
-                         Index_> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/thread_multiply_add.h
+++ b/cutlass/gemm/thread_multiply_add.h
@ -1,84 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Template implementing matrix multiply-add operations on fragments.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Template performing matrix multiply-add operation within a thread
-template <typename AccumulatorsPerThread_,
-          typename ThreadsPerWarp_,
-          typename ScalarA_,
-          typename ScalarB_,
-          typename ScalarC_>
-struct ThreadMultiplyAdd {
-  /// The shape of the instruction.
-  typedef Shape<1, 1, 1, 1> InstructionShape;
-  /// The number of accumulators per thread.
-  typedef AccumulatorsPerThread_ AccumulatorsPerThread;
-  /// The number of threads per warp.
-  typedef ThreadsPerWarp_ ThreadsPerWarp;
-  /// The number of accumulators per warp.
-  typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
-  /// The type for A.
-  typedef ScalarA_ ScalarA;
-  /// The fragment for A.
-  typedef Fragment<ScalarA, AccumulatorsPerThread::kW> FragmentA;
-  /// The type for B.
-  typedef ScalarB_ ScalarB;
-  /// The fragment for B.
-  typedef Fragment<ScalarB, AccumulatorsPerThread::kH> FragmentB;
-  /// The type for C and D.
-  typedef ScalarC_ ScalarC;
-  /// The accumulators.
-  typedef Fragment<ScalarC, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW, 16> Accumulators;
-
-  /// Ctor.
-  CUTLASS_DEVICE ThreadMultiplyAdd() {}
-
-  /// Multiply : d = a*b + c.
-  CUTLASS_DEVICE void multiply_add(FragmentA const& a,
-                                   FragmentB const& b,
-                                   Accumulators const& c,
-                                   Accumulators& d) {
-    for (int j = 0; j < AccumulatorsPerThread::kH; ++j) {
-      for (int i = 0; i < AccumulatorsPerThread::kW; ++i) {
-        d[j * AccumulatorsPerThread::kW + i] = a[i] * b[j] + c[j * AccumulatorsPerThread::kW + i];
-      }
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/wmma_gemm_epilogue_traits.h
+++ b/cutlass/gemm/wmma_gemm_epilogue_traits.h
@ -1,161 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines structural properties of WMMA GEMM's epilogue phase.
-*/
-#pragma once
-
-#include <cutlass/wmma_matrix.h>
-#ifdef CUTLASS_USE_WMMA_API
-
-#include <cutlass/convert.h>
-#include <cutlass/coord.h>
-#include <cutlass/gemm/gemm_global_stream.h>
-#include <cutlass/gemm/gemm_shared_stream.h>
-#include <cutlass/gemm/linear_scaling.h>
-#include <cutlass/gemm/wmma_gemm_global_tile.h>
-#include <cutlass/gemm/wmma_gemm_shared_tile.h>
-#include <cutlass/reshape_tile.h>
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_, typename EpilogueFunctor_, typename Index_ = int>
-struct WmmaGemmEpilogueTraitsHelper {
-  /// The scalar.
-  typedef typename EpilogueFunctor_::Scalar Scalar;
-  /// The output tile.
-  typedef typename GemmConfig_::OutputTile OutputTile;
-
-  /// The number of WMMAs in the H dimension.
-  static int const kWmmasPerH =
-      GemmConfig_::AccumulatorsPerWarp::kH / GemmConfig_::InstructionShape::kH;
-  /// The number of iterations in the epilogue. That's the number of "horizontal" WMMAs.
-  typedef Shape<1, 1, kWmmasPerH> Iterations;
-  // The iteration strides in the H/W dimension.
-  typedef Shape<0, 0, 0> Delta;
-  /// The functor to do the math in the epilogue.
-  typedef EpilogueFunctor_ Functor;
-
-  /// The traits class to build the iterator to store to shared memory for D.
-  typedef WmmaGemmSharedStoreTileDTraits<
-      // The output layout.
-      MatrixLayout::kColumnMajor,
-      // The pointer is float.
-      typename Functor::Scalar,
-      // The output tile size.
-      typename GemmConfig_::OutputTile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The shape of the instruction.
-      typename GemmConfig_::InstructionShape>
-      SharedStoreTileTraits;
-
-  typedef WmmaMatrix<GemmOperand::kC,
-                     MatrixLayout::kColumnMajor,
-                     Scalar,
-                     typename GemmConfig_::InstructionShape>
-      WmmaMatrix;
-
-  /// The iterator to store D to shared memory.
-  typedef TileStoreIterator<SharedStoreTileTraits,
-                            typename SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared,
-                            Index_,
-                            WmmaMatrix,
-                            IteratorFragment::kWmmaMatrix>
-      SharedStoreIteratorD;
-
-  /// The shared store transformer for D.
-  typedef Copy<typename SharedStoreIteratorD::Fragment> SharedStoreTransformerD;
-
-  /// The traits class to build the iterator to load from shared memory for D.
-  typedef WmmaGemmSharedLoadTileDTraits<
-      // The pointer.
-      typename Functor::Scalar,
-      // The tile size.
-      typename SharedStoreIteratorD::Tile,
-      // The number of threads.
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDS.
-      GemmConfig_::kScalarsPerLdsD>
-      SharedLoadTileTraits;
-
-  /// The iterator to load D from shared memory.
-  typedef TileLoadIterator<SharedLoadTileTraits,
-                           typename SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared>
-      SharedLoadIteratorD;
-
-  /// The traits class to build the iterator to load data from global memory for C^N.
-  typedef WmmaGemmGlobalIteratorCdTraits<
-      // The pointer is float const.
-      typename GemmConfig_::ScalarC const,
-      // The tile has size (N / Iterations)xM in GEMM's terminology.
-      Shape<1,
-            GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
-            GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgC>
-      GlobalLoadTileTraits;
-
-  /// The iterator to load C.
-  typedef WmmaGemmGlobalIteratorCd<GlobalLoadTileTraits, Index_> GlobalLoadIteratorC;
-  /// The transformer for C.
-  typedef Copy<typename GlobalLoadIteratorC::Fragment> GlobalTransformerC;
-
-  /// The traits class to build the iterator to store data to global memory for D^N.
-  typedef WmmaGemmGlobalIteratorCdTraits<
-      // The pointer is float.
-      typename GemmConfig_::ScalarD,
-      // The tile has size (N / Iterations)xM in GEMM's terminology.
-      Shape<1,
-            GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
-            GemmConfig_::OutputTile::kW>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerStgD>
-      GlobalStoreTileTraits;
-
-  /// The iterator to store D.
-  typedef WmmaGemmGlobalIteratorCd<GlobalStoreTileTraits, Index_> GlobalStoreIteratorD;
-  /// The transformer for D.
-  typedef Copy<typename GlobalStoreIteratorD::Fragment> GlobalTransformerD;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
-
-#endif  // defined CUTLASS_USE_WMMA_API
--- a/cutlass/gemm/wmma_gemm_global_tile.h
+++ b/cutlass/gemm/wmma_gemm_global_tile.h
@ -1,203 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines tile iterator traits for loading thread block-level tile from global memory.
-*/
-#pragma once
-
-#include <cutlass/gemm/gemm_global_tile.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Tile_, typename Threads_, int kAccessSize_>
-struct WmmaGemmGlobalIteratorCdTraits : public GemmGlobalTileTraits<GemmOperand::kC,
-                                                                    MatrixLayout::kColumnMajor,
-                                                                    Scalar_,
-                                                                    Tile_,
-                                                                    Threads_,
-                                                                    kAccessSize_> {
-  /// The base class.
-  typedef GemmGlobalTileTraits<GemmOperand::kC,
-                               MatrixLayout::kColumnMajor,
-                               Scalar_,
-                               Tile_,
-                               Threads_,
-                               kAccessSize_>
-      Base;
-
-  /// Override the strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> Delta;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int thread_offset_h = threadIdx.x / Base::Threads::kW;
-      int thread_offset_w = threadIdx.x % Base::Threads::kW * Base::ThreadsDelta::kW;
-
-      return make_Coord(0, thread_offset_h, thread_offset_w, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename TileTraits_, typename Index_ = int>
-struct WmmaGemmGlobalIteratorCd : public TileIteratorBase<TileTraits_,
-                                                          typename TileTraits_::Scalar,
-                                                          IteratorAdvance::kH,
-                                                          MemorySpace::kGlobal,
-                                                          Index_> {
-  /// This class.
-  typedef WmmaGemmGlobalIteratorCd<TileTraits_, Index_> This_;
-  /// The traits.
-  typedef TileTraits_ Traits;
-  /// The base class.
-  typedef TileIteratorBase<Traits,
-                           typename TileTraits_::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kGlobal,
-                           Index_>
-      Base;
-  /// Override the strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> ImmediateOffsetStrides;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
-
-  /// The scalar.
-  typedef typename TileTraits_::Scalar Scalar;
-  /// The pointer.
-  typedef typename TileTraits_::Pointer Pointer;
-  /// The threads.
-  typedef typename TileTraits_::Threads Threads;
-  /// The index.
-  typedef Index_ Index;
-  /// The thread offset functor.
-  typedef typename TileTraits_::ThreadOffset ThreadOffset;
-
-  /// The params.
-  struct Params {
-    /// The pointer.
-    Pointer pointer;
-    /// The stride in the H dimension to setup the thread in the block.
-    Index stride_h;
-    /// The strides to increment the pointer.
-    Index inc_h, inc_advance;
-    /// The column offset to compute the predicate for the columns.
-    Index predicate_offset;
-    /// The strides to increment the predicate offset.
-    Index predicate_inc_h, predicate_inc_advance;
-
-    /// Setup the params.
-    CUTLASS_HOST_DEVICE int initialize(
-        Pointer pointer, Index ld, Index n, Index epilogue_stride_w, Index epilogue_delta_w) {
-      // The pointer.
-      this->pointer = pointer;
-      // Setup the base stride. One "group of threads" per column.
-      stride_h = ld;
-      // Each thread output 1 column per iteration. .
-      inc_h = ld * TileTraits_::Threads::kH;
-      inc_advance = inc_h + epilogue_stride_w;
-
-      predicate_offset = n;
-      predicate_inc_h = TileTraits_::Threads::kH;
-      predicate_inc_advance = predicate_inc_h + epilogue_delta_w;
-
-      // It worked.
-      return 0;
-    }
-  };
-
-  Params params;
-
-  Coord<4> thread_offset;
-
-  /// Ctor.
-  CUTLASS_DEVICE WmmaGemmGlobalIteratorCd() {}
-
-  /// Ctor.
-  CUTLASS_DEVICE WmmaGemmGlobalIteratorCd(Params const& params,
-                                          const Coord<3>& bounds,
-                                          const Coord<3>& block,
-                                          int const pointer_offset = 0,
-                                          int const pred_offset = 0,
-                                          ThreadOffset thread_offset_func = ThreadOffset())
-
-      : params(params) {
-    thread_offset = thread_offset_func();
-    // Each warp works on a different column of the tile.
-    int const h = thread_offset[1] + block[1];
-    // Each lane writes a different element.
-    int const w = thread_offset[2] + block[2];
-    // Setup the pointer.
-    this->params.pointer += ((h * params.stride_h + w) + pointer_offset);
-
-    // Prepare the vector of predicates.
-    for (int i = 0; i < Base::Iterations::kW; ++i) {
-      predicates.set(i, w + i * Base::Delta::kW < bounds[2]);
-    }
-    this->params.predicate_offset -= (h + pred_offset);
-  }
-
-  /// Increment the pointer in the C dimension.
-  CUTLASS_DEVICE void inc_c() {}
-  /// Increment the pointer in the W dimension.
-  CUTLASS_DEVICE void inc_w() {}
-  /// Increment the pointer in the H dimension.
-  CUTLASS_DEVICE void inc_h() {
-    params.pointer += params.inc_h;
-    params.predicate_offset -= params.predicate_inc_h;
-  }
-  /// Increment the pointer in the D dimension.
-  CUTLASS_DEVICE void inc_d() {}
-  /// Increment the pointer to move to the next iteration.
-  CUTLASS_DEVICE void inc_advance() {
-    params.pointer += params.inc_advance;
-    params.predicate_offset -= params.predicate_inc_advance;
-  }
-
-  /// Test the predicate.
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
-    return predicates.at(w) && params.predicate_offset > 0;
-  }
-
-  /// Returns the raw pointer
-  CUTLASS_HOST_DEVICE
-  Pointer data() { return params.pointer; }
-
-  CUTLASS_HOST_DEVICE
-  Pointer const data() const { return params.pointer; }
-
-  /// The predicates for the row.
-  cutlass::PredicateVector<Base::Iterations::kW> predicates;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
--- a/cutlass/gemm/wmma_gemm_multiply_add.h
+++ b/cutlass/gemm/wmma_gemm_multiply_add.h
@ -1,108 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API.
-*/
-#pragma once
-
-#include <cutlass/wmma_matrix.h>
-#ifdef CUTLASS_USE_WMMA_API
-#include <cutlass/fragment.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MatrixLayout::Kind kLayoutA_,
-          typename ScalarA_,
-          MatrixLayout::Kind kLayoutB_,
-          typename ScalarB_,
-          MatrixLayout::Kind kLayoutC_,
-          typename ScalarC_,
-          typename AccumulatorsPerWarp_,
-          typename InstructionShape_>
-struct WmmaGemmMultiplyAdd {
-  /// The shape of the instruction.
-  typedef InstructionShape_ InstructionShape;
-  /// The number of threads per warp. That's a dummy configuration.
-  typedef Shape<1, InstructionShape_::kH, InstructionShape_::kW> ThreadsPerWarp;
-  /// The dimensions.
-  typedef AccumulatorsPerWarp_ AccumulatorsPerWarp;
-  /// The type for A.
-  typedef ScalarA_ ScalarA;
-  /// The type for B.
-  typedef ScalarB_ ScalarB;
-  /// The type for C and D.
-  typedef ScalarC_ ScalarC;
-  /// The number of iterations.
-  typedef typename ShapeDiv<AccumulatorsPerWarp, InstructionShape>::Shape Iterations;
-
-  /// The element for A.
-  typedef WmmaMatrix<GemmOperand::kA, kLayoutA_, ScalarA, InstructionShape> ElementA;
-  /// The fragment for A.
-  typedef Fragment<ElementA, Iterations::kW> FragmentA;
-
-  /// The element for B.
-  typedef WmmaMatrix<GemmOperand::kB, kLayoutB_, ScalarB, InstructionShape> ElementB;
-  /// The fragment for B.
-  typedef Fragment<ElementB, Iterations::kH> FragmentB;
-
-  /// The element for C.
-  typedef WmmaMatrix<GemmOperand::kC, kLayoutC_, ScalarC, InstructionShape> ElementC;
-  /// The fragment for C.
-  typedef Fragment<ElementC, Iterations::kH * Iterations::kW> Accumulators;
-
-  /// Ctor.
-  CUTLASS_DEVICE WmmaGemmMultiplyAdd() {}
-
-  /// Multiply : d = a*b.
-  CUTLASS_DEVICE void multiply_add(FragmentA const& a,
-                                   FragmentB const& b,
-                                   Accumulators const& c,
-                                   Accumulators& d) {
-    for (int j = 0; j < Iterations::kH; ++j) {
-      for (int i = 0; i < Iterations::kW; ++i) {
-        // The input elements.
-        ElementA const& elt_a = a[i];
-        ElementB const& elt_b = b[j];
-        ElementC const& elt_c = c[j * Iterations::kW + i];
-
-        // The output element.
-        ElementC& elt_d = d[j * Iterations::kW + i];
-
-        // The wmma instruction.
-        nvcuda::wmma::mma_sync(elt_d, elt_a, elt_b, elt_c);
-      }
-    }
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
-
-#endif  // defined CUTLASS_USE_WMMA_API
--- a/cutlass/gemm/wmma_gemm_shared_tile.h
+++ b/cutlass/gemm/wmma_gemm_shared_tile.h
@ -1,240 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines iterator traits for efficiently loading and storing fragment to and from shared
-      memory, specialized for WMMA GEMM.
-*/
-#pragma once
-
-#include <cutlass/wmma_matrix.h>
-#ifdef CUTLASS_USE_WMMA_API
-
-#include <cutlass/gemm/gemm_operand.h>
-#include <cutlass/reshape_tile.h>
-
-namespace cutlass {
-namespace gemm {
-
-template <class>
-struct Debug {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename Tile_,
-          typename Warps_,
-          int kWarpStride_,
-          typename Iterations_,
-          typename Delta_,
-          typename WmmaShape_>
-struct WmmaGemmSharedLoadTileATraits {
-  /// The operand.
-  static GemmOperand::Kind const kOperand = GemmOperand::kA;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = kLayout_;
-  /// The scalar.
-  typedef Scalar_ Scalar;
-  /// The pointer.
-  typedef Scalar const* Pointer;
-  /// The access size
-  static int const kAccessSize = 1;
-  /// The tile with skew.
-  typedef Tile_ Tile;
-  /// The number of warps.
-  typedef Warps_ Warps;
-  /// The warps strides.
-  static int const kWarpStride = kWarpStride_;
-  /// The number of iterations.
-  typedef Iterations_ Iterations;
-  /// The strides between iterations.
-  typedef Delta_ Delta;
-  /// The strides between iterations.
-  typedef Delta_ ImmediateOffsetStrides;
-  /// The shape of the WMMA instruction.
-  typedef WmmaShape_ WmmaShape;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-  /// ThreadOffset
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // The warp id.
-      int const warp = threadIdx.x / kWarpSize;
-      // The offset.
-      int const offset = warp % Warps::kW * kWarpStride;
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename Tile_,
-          typename Warps_,
-          int kWarpStride_,
-          typename Iterations_,
-          typename Delta_,
-          typename WmmaShape_>
-struct WmmaGemmSharedLoadTileBTraits {
-  /// The operand.
-  static GemmOperand::Kind const kOperand = GemmOperand::kB;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = kLayout_;
-  /// The scalar.
-  typedef Scalar_ Scalar;
-  /// The pointer.
-  typedef Scalar const* Pointer;
-  /// The access size
-  static int const kAccessSize = 1;
-  /// The tile with skew.
-  typedef Tile_ Tile;
-  /// The number of warps.
-  typedef Warps_ Warps;
-  /// The warps strides.
-  static int const kWarpStride = kWarpStride_;
-  /// The number of iterations.
-  typedef Iterations_ Iterations;
-  /// The strides between iterations.
-  typedef Delta_ Delta;
-  /// The strides between iterations.
-  typedef Delta_ ImmediateOffsetStrides;
-  /// The shape of the WMMA instruction.
-  typedef WmmaShape_ WmmaShape;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-  /// ThreadOffset
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // The warp id.
-      int const warp = threadIdx.x / kWarpSize;
-      // The offset.
-      int const offset = warp / Warps::kW * kWarpStride;
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename OutputTile_,
-          typename Warps_,
-          typename WmmaShape_,
-          int kSkew_ = 0>
-struct WmmaGemmSharedStoreTileDTraits {
-  /// The operand.
-  static GemmOperand::Kind const kOperand = GemmOperand::kC;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = kLayout_;
-  /// The scalar.
-  typedef Scalar_ Scalar;
-  // The access size
-  static int const kAccessSize = 1;
-  /// The pointer.
-  typedef Scalar* Pointer;
-  /// The number of warps.
-  typedef Warps_ Warps;
-  /// The shape of the WMMA instruction.
-  typedef WmmaShape_ WmmaShape;
-  /// The skew.
-  static int const kSkew = kSkew_;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-  /// The tile with skew.
-  typedef Shape<1, Warps_::kH * WmmaShape_::kH, OutputTile_::kW + kSkew_> Tile;
-  /// The number of iterations needed to store the tile.
-  typedef Shape<1, 1, OutputTile_::kW / Warps::kW / WmmaShape_::kW> Iterations;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Warps::kW * WmmaShape_::kW, 0> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, 0, Warps::kW * WmmaShape_::kW, 0> ImmediateOffsetStrides;
-
-  /// ThreadOffset
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // The warp id.
-      int const warp = threadIdx.x / kWarpSize;
-      // The starting column.
-      int const h = warp / Warps::kW * WmmaShape::kH;
-      // The w.
-      int const w = warp % Warps::kW * WmmaShape::kW;
-      // The offset.
-      int const offset = h * Tile::kW + w;
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerLds_>
-struct WmmaGemmSharedLoadTileDTraits {
-  /// The scalar.
-  typedef Scalar_ Scalar;
-  /// The pointer.
-  typedef Scalar const* Pointer;
-  /// The access size
-  static int const kAccessSize = kScalarsPerLds_;
-  /// The tile.
-  typedef typename ReshapeTile<Tile_, kScalarsPerLds_>::Tile Tile;
-  /// The threads.
-  typedef typename ReshapeThreads<Tile, Threads_>::Threads Threads;
-  /// The threads strides.
-  typedef Shape<1, Tile::kW * Tile::kC, Tile::kC> ThreadsStrides;
-  /// The memory space.
-  static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
-
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kScalarsPerLds_> Delta;
-  /// The strides in each dimension between different loads/stores.
-  typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kScalarsPerLds_>
-      ImmediateOffsetStrides;
-  /// The number of iterations needed to load/store the tile.
-  typedef Shape<1, Tile::kH / Threads::kH, Tile::kW / Threads::kW, Tile::kC / kScalarsPerLds_>
-      Iterations;
-
-  /// ThreadOffset
-  struct ThreadOffset {
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      // The offset.
-      int const offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
-      return make_Coord(0, 0, offset, 0);
-    }
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
-
-#endif  // defined CUTLASS_USE_WMMA_API
--- a/cutlass/gemm/wmma_gemm_traits.h
+++ b/cutlass/gemm/wmma_gemm_traits.h
@ -1,574 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defies structural properties of GEMM targeting WMMA API in CUDA.
-*/
-#pragma once
-
-#include <cutlass/wmma_matrix.h>
-#ifdef CUTLASS_USE_WMMA_API
-
-#include <cutlass/convert.h>
-#include <cutlass/gemm/gemm.h>
-#include <cutlass/gemm/gemm_epilogue.h>
-#include <cutlass/gemm/gemm_epilogue_traits.h>
-#include <cutlass/gemm/gemm_global_tile.h>
-#include <cutlass/gemm/gemm_shared_tile.h>
-#include <cutlass/gemm/gemm_traits.h>
-#include <cutlass/gemm/wmma_gemm_epilogue_traits.h>
-#include <cutlass/gemm/wmma_gemm_global_tile.h>
-#include <cutlass/gemm/wmma_gemm_multiply_add.h>
-
-namespace cutlass {
-namespace gemm {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_,
-    /// The output type.
-    typename ScalarC_,
-    /// The accumulator type.
-    typename Accumulator_,
-    /// The number of accumulators per warp.
-    typename AccumulatorsPerWarp_,
-    /// The shape of the WMMA instruction.
-    typename InstructionShape_,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_>
-struct WmmaGemmConfig : public GemmConfig<
-                            /// The scalar type for A.
-                            half,
-                            /// The scalar type for B.
-                            half,
-                            /// The scalar type for C.
-                            ScalarC_,
-                            /// The scalar type for D.
-                            ScalarC_,
-                            /// The tile size for the GEMM KxNxM.
-                            OutputTile_,
-                            /// The functor to do the math in the main loop.
-                            WmmaGemmMultiplyAdd<kLayoutA_,
-                                                half,
-                                                kLayoutB_,
-                                                half,
-                                                MatrixLayout::kColumnMajor,
-                                                Accumulator_,
-                                                AccumulatorsPerWarp_,
-                                                InstructionShape_>,
-                            /// The number of scalars per LDG for A.
-                            kScalarsPerLdgA_,
-                            /// The number of scalars per STS for A.
-                            kScalarsPerLdgA_,
-                            /// The number of scalars per LDS for A.
-                            8,
-                            /// The number of scalars per LDG for B.
-                            kScalarsPerLdgB_,
-                            /// The number of scalars per STS for B.
-                            kScalarsPerLdgB_,
-                            /// The number of scalars per LDS for B.
-                            8,
-                            /// The number of scalars per LDG for C and STG for D.
-                            16 / sizeof(ScalarC_),
-                            /// The number of scalars per STS for D.
-                            16 / sizeof(ScalarC_),
-                            /// The number of scalars per LDS for D.
-                            16 / sizeof(ScalarC_),
-                            /// The number of stages in shared memory.
-                            1> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperA {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_>
-    : public GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> Base;
-
-  /// The skew.
-  static int const kSkew = 16 / sizeof(typename Base::MultiplyAddScalar);
-  /// The shared tile size.
-  typedef Shape<GemmConfig_::kStages,
-                GemmConfig_::OutputTile::kD,
-                GemmConfig_::OutputTile::kW + kSkew>
-      Tile;
-
-  /// WMMA matrix
-  typedef WmmaMatrix<GemmOperand::kA,
-                     MatrixLayout::kColumnMajor,
-                     typename Base::MultiplyAddScalar,
-                     typename GemmConfig_::InstructionShape>
-      WmmaMatrix;
-
-  /// The traits class to build the iterator to store data to shared memory for A^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer.
-      typename Base::MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Tile,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename Base::GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsA>
-      SharedStoreTileTraits;
-
-  /// The number of elements loaded in one LDG.
-  static int const kScalarsPerW = GemmConfig_::InstructionShape::kW * GemmConfig_::Warps::kW;
-  /// The number of scalars loaded per iteration.
-  static int const kScalarsPerIteration = Tile::kW * GemmConfig_::InstructionShape::kD;
-  /// The traits class to build the iterator to load from shared memory for A.
-  typedef WmmaGemmSharedLoadTileATraits<
-      // The layout of the matrix.
-      MatrixLayout::kColumnMajor,
-      // The pointer.
-      typename Base::MultiplyAddScalar,
-      // The output tile size.
-      Tile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The strides between warps.
-      GemmConfig_::InstructionShape::kW,
-      // The number of iterations to load the data.
-      Shape<1, 1, GemmConfig_::OutputTile::kW / kScalarsPerW>,
-      // The stride between iterations.
-      Shape<kScalarsPerIteration, 0, kScalarsPerW, 0>,
-      // The shape of the instruction.
-      typename GemmConfig_::InstructionShape>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarA Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
-
-  /// WMMA matrix
-  typedef WmmaMatrix<GemmOperand::kA,
-                     MatrixLayout::kRowMajor,
-                     MultiplyAddScalar,
-                     typename GemmConfig_::InstructionShape>
-      WmmaMatrix;
-
-  /// The traits class to build the iterator to load data from global memory for A^T.
-  typedef GemmGlobalTileTraits<
-      // That's A.
-      GemmOperand::kA,
-      // A is row-major.
-      MatrixLayout::kRowMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgA>
-      GlobalTileTraits;
-
-  /// The skew.
-  static int const kSkew = 16 / sizeof(MultiplyAddScalar);
-  /// The tile.
-  typedef Shape<GemmConfig_::kStages,
-                GemmConfig_::OutputTile::kW,
-                GemmConfig_::OutputTile::kD + kSkew>
-      Tile;
-
-  /// The traits class to build the iterator to store data to shared memory for A^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer.
-      MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Tile,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsA>
-      SharedStoreTileTraits;
-
-  /// The number of elements loaded in one LDG.
-  static int const kScalarsPerW = GemmConfig_::InstructionShape::kW * GemmConfig_::Warps::kW;
-  /// The traits class to build the iterator to load from shared memory for A.
-  typedef WmmaGemmSharedLoadTileATraits<
-      // The layout of the matrix.
-      MatrixLayout::kRowMajor,
-      // The pointer.
-      MultiplyAddScalar,
-      // The tile in shared memory.
-      Tile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The strides between warps.
-      GemmConfig_::InstructionShape::kW * Tile::kW,
-      // The number of iterations to load the data.
-      Shape<1, 1, GemmConfig_::OutputTile::kW / kScalarsPerW>,
-      // The stride between iterations.
-      Shape<GemmConfig_::InstructionShape::kD, 0, kScalarsPerW * Tile::kW>,
-      // The shape of the instruction.
-      typename GemmConfig_::InstructionShape>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperB {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_>
-    : public GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
-  /// The base config.
-  typedef GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> Base;
-
-  /// The skew.
-  static int const kSkew = 16 / sizeof(typename Base::MultiplyAddScalar);
-  /// The shared tile size.
-  typedef Shape<GemmConfig_::kStages,
-                GemmConfig_::OutputTile::kD,
-                GemmConfig_::OutputTile::kH + kSkew>
-      Tile;
-
-  /// WMMA matrix
-  typedef WmmaMatrix<GemmOperand::kB,
-                     MatrixLayout::kRowMajor,
-                     typename Base::MultiplyAddScalar,
-                     typename GemmConfig_::InstructionShape>
-      WmmaMatrix;
-
-  /// The traits class to build the iterator to store data to shared memory for B^T.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer.
-      typename Base::MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Tile,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename Base::GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsB>
-      SharedStoreTileTraits;
-
-  /// The number of elements loaded in one LDG.
-  static int const kScalarsPerW = GemmConfig_::InstructionShape::kH * GemmConfig_::Warps::kH;
-  /// The number of scalars loaded per iteration.
-  static int const kScalarsPerIteration = Tile::kW * GemmConfig_::InstructionShape::kD;
-  /// The traits class to build the iterator to load from shared memory for B.
-  typedef WmmaGemmSharedLoadTileBTraits<
-      // The layout of the matrix.
-      MatrixLayout::kRowMajor,
-      // The pointer.
-      typename Base::MultiplyAddScalar,
-      // The output tile size.
-      Tile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The strides between warps.
-      GemmConfig_::InstructionShape::kH,
-      // The number of iterations to load the data.
-      Shape<1, 1, GemmConfig_::OutputTile::kH / kScalarsPerW>,
-      // The stride between iterations.
-      Shape<kScalarsPerIteration, 0, kScalarsPerW, 0>,
-      // The shape of the instruction.
-      typename GemmConfig_::InstructionShape>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename GemmConfig_>
-struct WmmaGemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
-
-  /// The input scalar.
-  typedef typename GemmConfig_::ScalarB Scalar;
-  /// The scalar stored in shared memory.
-  typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
-
-  /// WMMA matrix
-  typedef WmmaMatrix<GemmOperand::kB,
-                     MatrixLayout::kColumnMajor,
-                     MultiplyAddScalar,
-                     typename GemmConfig_::InstructionShape>
-      WmmaMatrix;
-
-  /// The traits class to build the iterator to load data from global memory for B^N.
-  typedef GemmGlobalTileTraits<
-      // That's B.
-      GemmOperand::kB,
-      // A is row-major.
-      MatrixLayout::kColumnMajor,
-      // The pointer is float const.
-      Scalar const,
-      // The tile has size KxM in GEMM's terminology.
-      Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
-      // The number of scalars per LDG (LDG.32 or LDG.128, etc).
-      GemmConfig_::kScalarsPerLdgB>
-      GlobalTileTraits;
-
-  /// The skew.
-  static int const kSkew = 16 / sizeof(MultiplyAddScalar);
-  /// The tile.
-  typedef Shape<GemmConfig_::kStages,
-                GemmConfig_::OutputTile::kH,
-                GemmConfig_::OutputTile::kD + kSkew>
-      Tile;
-
-  /// The traits class to build the iterator to store data to shared memory for B^N.
-  typedef GemmSharedStoreTileAbTraits<
-      // The pointer.
-      MultiplyAddScalar,
-      // The tile has size KxM in GEMM's terminology.
-      Tile,
-      // The threads are distributed as warps x 32 (the traits may reorganize).
-      typename GlobalTileTraits::Threads,
-      // The number of scalars per STS (STS.32 or STS.128, etc).
-      GemmConfig_::kScalarsPerStsB>
-      SharedStoreTileTraits;
-
-  /// The number of elements loaded in one LDG.
-  static int const kScalarsPerW = GemmConfig_::InstructionShape::kH * GemmConfig_::Warps::kH;
-  /// The traits class to build the iterator to load from shared memory for B.
-  typedef WmmaGemmSharedLoadTileBTraits<
-      // The layout of the matrix.
-      MatrixLayout::kColumnMajor,
-      // The pointer.
-      MultiplyAddScalar,
-      // The tile in shared memory.
-      Tile,
-      // The number of warps.
-      typename GemmConfig_::Warps,
-      // The strides between warps.
-      GemmConfig_::InstructionShape::kH * Tile::kW,
-      // The number of iterations to load the data.
-      Shape<1, 1, GemmConfig_::OutputTile::kH / kScalarsPerW>,
-      // The stride between iterations.
-      Shape<GemmConfig_::InstructionShape::kD, 0, kScalarsPerW * Tile::kW>,
-      // The shape of the instruction.
-      typename GemmConfig_::InstructionShape>
-      SharedLoadTileTraits;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The output tile.
-    typename OutputTile_,
-    /// The output type.
-    typename ScalarC_,
-    /// The accumulator type.
-    typename Accumulator_,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_,
-    /// The number of accumulators per warp.
-    typename AccumulatorsPerWarp_,
-    /// The shape of the WMMA instruction.
-    typename InstructionShape_,
-    /// The number of halfs loaded in one LDG for A.
-    int kScalarsPerLdgA_,
-    /// The number of halfs loaded in one LDG for B.
-    int kScalarsPerLdgB_,
-    /// The index.
-    typename Index_>
-struct WmmaGemmTraitsHelper {
-  /// The WMMA GEMM config.
-  typedef WmmaGemmConfig<kLayoutA_,
-                         kLayoutB_,
-                         OutputTile_,
-                         ScalarC_,
-                         Accumulator_,
-                         AccumulatorsPerWarp_,
-                         InstructionShape_,
-                         kScalarsPerLdgA_,
-                         kScalarsPerLdgB_>
-      GemmConfig;
-
-  /// The GEMM config for A.
-  typedef WmmaGemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
-  /// The GEMM config for B.
-  typedef WmmaGemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
-
-  /// The iterator to load A from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
-      GlobalLoadIteratorA;
-  /// The default transformer for A.
-  typedef Copy<typename GlobalLoadIteratorA::Fragment> GlobalTransformerA;
-  /// The iterator to store A to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorA;
-  /// The stream to load A from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
-      GlobalLoadStreamA;
-
-  /// The iterator to load B from global memory.
-  typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
-      GlobalLoadIteratorB;
-  // The default transformer for B.
-  typedef Copy<typename GlobalLoadIteratorB::Fragment> GlobalTransformerB;
-  /// The iterator to store B to shared memory.
-  typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
-                            typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
-                            IteratorAdvance::kH,
-                            MemorySpace::kShared>
-      SharedStoreIteratorB;
-  /// The stream to load B from global memory to shared memory.
-  typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
-      GlobalLoadStreamB;
-
-  /// The iterator to load A from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared,
-                           Index_,
-                           typename GemmTileTraitsHelperA::WmmaMatrix,
-                           IteratorFragment::kWmmaMatrix>
-      SharedLoadIteratorA;
-  /// The stream to load A from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
-  /// The iterator to load B from shared memory.
-  typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
-                           typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
-                           IteratorAdvance::kH,
-                           MemorySpace::kShared,
-                           Index_,
-                           typename GemmTileTraitsHelperB::WmmaMatrix,
-                           IteratorFragment::kWmmaMatrix>
-      SharedLoadIteratorB;
-  /// The stream to load B from shared memory.
-  typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
-
-  /// The functor to do the multiply-add in the main loop.
-  typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
-  /// The object to clear accumulators.
-  typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
-
-  /// The helper to create the epilogue traits.
-  typedef WmmaGemmEpilogueTraitsHelper<GemmConfig, EpilogueFunctor_, Index_> EpilogueTraitsHelper;
-  /// The traits class for the epilogue.
-  typedef SimplifiedGemmEpilogueTraits<GemmConfig, EpilogueFunctor_, Index_, EpilogueTraitsHelper>
-      GemmEpilogueTraits;
-  /// The epilogue.
-  typedef GemmEpilogue<GemmEpilogueTraits> Epilogue;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename OutputTile_, typename DefaultShape_ = Shape<64, 32, 64> >
-struct WmmaGemmAccumulatorsPerWarp {
-  typedef typename ShapeMin<OutputTile_, DefaultShape_>::Shape Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <
-    /// The layout for A.
-    MatrixLayout::Kind kLayoutA_,
-    /// The layout for B.
-    MatrixLayout::Kind kLayoutB_,
-    /// The tile size for the GEMM KxNxM.
-    typename OutputTile_ = Shape<64, 128, 128>,
-    /// The output type.
-    typename ScalarC_ = float,
-    /// The functor to do the math in the epilogue.
-    typename EpilogueFunctor_ = LinearScaling<ScalarC_>,
-    /// The accumulator type.
-    typename Accumulator_ = ScalarC_,
-    /// The number of accumulators per warp.
-    typename AccumulatorsPerWarp_ = typename WmmaGemmAccumulatorsPerWarp<OutputTile_>::Shape,
-    /// The shape of the WMMA instruction.
-    typename InstructionShape_ = Shape<16, 16, 16>,
-    /// The number of scalars per LDG for A.
-    int kScalarsPerLdgA_ = 8,
-    /// The number of scalars per LDG for B.
-    int kScalarsPerLdgB_ = 8,
-    /// The index.
-    typename Index_ = int,
-    /// The helper class.
-    typename Helper_ = WmmaGemmTraitsHelper<kLayoutA_,
-                                            kLayoutB_,
-                                            OutputTile_,
-                                            ScalarC_,
-                                            Accumulator_,
-                                            EpilogueFunctor_,
-                                            AccumulatorsPerWarp_,
-                                            InstructionShape_,
-                                            kScalarsPerLdgA_,
-                                            kScalarsPerLdgB_,
-                                            Index_> >
-struct WmmaGemmTraits : public GemmTraits<
-                            // The config.
-                            typename Helper_::GemmConfig,
-                            // The stream to load A from global memory to shared memory.
-                            typename Helper_::GlobalLoadStreamA,
-                            // The stream to load B from global memory to shared memory.
-                            typename Helper_::GlobalLoadStreamB,
-                            // The stream to load A from shared memory.
-                            typename Helper_::SharedLoadStreamA,
-                            // The stream to load B from shared memory.
-                            typename Helper_::SharedLoadStreamB,
-                            // The epilogue.
-                            typename Helper_::Epilogue,
-                            // The block swizzle to reorganize the grid.
-                            IdentityBlockSwizzle,
-                            // The index.
-                            Index_,
-                            // The tool used to clear accumulators.
-                            typename Helper_::ClearAccumulators> {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace gemm
-}  // namespace cutlass
-
-#endif  // defined CUTLASS_USE_WMMA_API
--- a/cutlass/iterator_access.h
+++ b/cutlass/iterator_access.h
@ -1,325 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Free functions for loading and storing to implementations of tile iteartor concepts.
-*/
-#pragma once
-
-#include <cutlass/fragment_load_store.h>
-#include <cutlass/load_store.h>
-#include <cutlass/predicate_vector.h>
-#include <cutlass/shape.h>
-
-namespace cutlass {
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Loads a fragment from an input iterator
-template <typename InputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_load(InputIterator &iterator, Fragment &fragment) {
-  typename InputIterator::FragmentIterator frag_iterator(fragment);
-  for (int d = 0; d < InputIterator::Iterations::kD; ++d) {
-    for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
-      for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
-        for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
-          if (iterator.valid(d, h, w, c)) {
-            int const offset =
-                ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
-                    0, 0, w, c);
-            Load<typename Fragment::Element, InputIterator::Tile::kC, InputIterator::kMemorySpace>::
-                load(reinterpret_cast<typename InputIterator::AccessType &>(
-                         frag_iterator.at(d, h, w, c)),
-                     iterator.data(),
-                     offset);
-          }
-        }
-        if (w < InputIterator::Iterations::kW - 1) {
-          iterator.inc_w();
-        }
-      }
-      if (h < InputIterator::Iterations::kH - 1) {
-        iterator.inc_h();
-      }
-    }
-    if (d < InputIterator::Iterations::kD - 1) {
-      iterator.inc_d();
-    }
-  }
-  iterator.inc_advance();
-}
-
-/// Loads a fragment from a shared memory input iterator
-template <typename InputIterator, typename Fragment>
-CUTLASS_DEVICE void shared_iterator_load(InputIterator &iterator, Fragment &fragment) {
-  typename InputIterator::FragmentIterator frag_iterator(fragment);
-  for (int d = 0; d < InputIterator::Iterations::kD; ++d) {
-    for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
-      for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
-        for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
-          int const offset =
-              ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
-                  d, h, w, c);
-
-          FragmentLoad<InputIterator::kIteratorFragment,
-                       InputIterator::Tile::kC,
-                       typename InputIterator::Scalar,
-                       InputIterator::kMemorySpace,
-                       typename InputIterator::FragmentElement,
-                       InputIterator::Tile::kW>::load(frag_iterator.at(d, h, w, c),
-                                                      iterator.data(),
-                                                      offset);
-        }
-      }
-    }
-  }
-}
-
-/// Loads a fragment from a shared memory input iterator
-template <typename InputIterator, typename Fragment>
-CUTLASS_DEVICE void shared_iterator_load(InputIterator &iterator, Fragment &fragment, int d) {
-  typename InputIterator::FragmentIterator frag_iterator(fragment);
-  for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
-    for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
-      for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
-        int const offset =
-            ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
-                d, h, w, c);
-
-        FragmentLoad<InputIterator::kIteratorFragment,
-                     InputIterator::Tile::kC,
-                     typename InputIterator::Scalar,
-                     InputIterator::kMemorySpace,
-                     typename InputIterator::FragmentElement,
-                     InputIterator::Tile::kW>::load(frag_iterator.at(0, h, w, c),
-                                                    iterator.data(),
-                                                    offset);
-      }
-    }
-  }
-}
-
-/// Loads a fragment from an input iterator, masked by a predicate iterator
-template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
-                                                      Fragment &fragment,
-                                                      typename InputIterator::Index offset,
-                                                      ConstPredicateAdapter predicate_adapter) {
-  for (int d = 0; d < InputIterator::Iterations::kD; ++d, iterator.inc_d()) {
-    for (int h = 0; h < InputIterator::Iterations::kH; ++h, iterator.inc_h()) {
-      for (int w = 0; w < InputIterator::Iterations::kW; ++w, iterator.inc_w()) {
-        if (predicate_adapter.at(d, h, w, 0)) {
-          int idx = InputIterator::Tile::kC *
-                    (w + InputIterator::Iterations::kW * (h + InputIterator::Iterations::kH * d));
-
-          Load<typename Fragment::Element, InputIterator::Tile::kC, InputIterator::kMemorySpace>::
-              load(reinterpret_cast<typename InputIterator::AccessType &>(fragment[idx]),
-                   iterator.data(),
-                   offset);
-        }
-      }
-    }
-  }
-}
-
-/// Loads a fragment from an input iterator
-template <typename InputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
-                                                      Fragment &fragment,
-                                                      typename InputIterator::Index offset = 0) {
-  TrivialPredicateTileAdapter pred;
-  iterator_load_post_increment(iterator, fragment, offset, pred);
-}
-
-/// Loads a fragment from an input iterator
-template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
-                                                      Fragment &fragment,
-                                                      ConstPredicateAdapter pred_it) {
-  iterator_load_post_increment(iterator, fragment, 0, pred_it);
-}
-
-template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &_iterator,
-                                       Fragment &fragment,
-                                       typename InputIterator::Index offset,
-                                       ConstPredicateAdapter predicate_adapter) {
-  InputIterator iterator(_iterator);
-  iterator_load_post_increment(iterator, fragment, offset, predicate_adapter);
-}
-
-/// Loads a fragment from an input iterator
-template <typename InputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &iterator,
-                                       Fragment &fragment,
-                                       typename InputIterator::Index offset = 0) {
-  TrivialPredicateTileAdapter pred;
-  iterator_load(iterator, fragment, offset, pred);
-}
-
-/// Loads a fragment from an input iterator
-template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &iterator,
-                                       Fragment &fragment,
-                                       ConstPredicateAdapter pred_it) {
-  iterator_load(iterator, fragment, 0, pred_it);
-}
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Stores a fragment to an output iterator
-template <typename OutputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_store(OutputIterator &iterator, Fragment &fragment) {
-  typename OutputIterator::FragmentIterator frag_iterator(fragment);
-  for (int d = 0; d < OutputIterator::Iterations::kD; ++d) {
-    for (int h = 0; h < OutputIterator::Iterations::kH; ++h) {
-      for (int w = 0; w < OutputIterator::Iterations::kW; ++w) {
-        if (iterator.valid(d, h, w, 0)) {
-          int const offset =
-              ComputeOffsetFromStrides<typename OutputIterator::ImmediateOffsetStrides>::get(
-                  d, h, w, 0);
-
-          Store<typename Fragment::Element,
-                OutputIterator::Tile::kC,
-                OutputIterator::kMemorySpace>::
-              store(reinterpret_cast<typename OutputIterator::AccessType &>(
-                        frag_iterator.at(d, h, w, 0)),
-                    iterator.data(),
-                    offset);
-        }
-        if (w < OutputIterator::Iterations::kW - 1) {
-          iterator.inc_w();
-        }
-      }
-      if (h < OutputIterator::Iterations::kH - 1) {
-        iterator.inc_h();
-      }
-    }
-    if (d < OutputIterator::Iterations::kD - 1) {
-      iterator.inc_d();
-    }
-  }
-  iterator.inc_advance();
-}
-
-/// Stores a fragment to a shared memory output iterator
-template <typename OutputIterator, typename Fragment>
-CUTLASS_DEVICE void shared_iterator_store(OutputIterator &iterator, Fragment const &fragment) {
-  typename OutputIterator::FragmentConstIterator frag_iterator(fragment);
-  for (int d = 0; d < OutputIterator::Iterations::kD; ++d) {
-    for (int h = 0; h < OutputIterator::Iterations::kH; ++h) {
-      for (int w = 0; w < OutputIterator::Iterations::kW; ++w) {
-        for (int c = 0; c < OutputIterator::Iterations::kC; ++c) {
-          int const offset =
-              ComputeOffsetFromStrides<typename OutputIterator::ImmediateOffsetStrides>::get(
-                  d, h, w, c);
-
-          FragmentStore<OutputIterator::kIteratorFragment,
-                        OutputIterator::Tile::kC,
-                        typename OutputIterator::Scalar,
-                        OutputIterator::kMemorySpace,
-                        typename OutputIterator::FragmentElement,
-                        OutputIterator::Tile::kW>::store(frag_iterator.at(d, h, w, c),
-                                                         iterator.data(),
-                                                         offset);
-        }
-      }
-    }
-  }
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Stores a fragment to an output iterator, masked by a predicate iterator
-template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
-                                                       Fragment const &fragment,
-                                                       typename OutputIterator::Index offset,
-                                                       ConstPredicateAdapter predicate_adapter) {
-  for (int d = 0; d < OutputIterator::Iterations::kD; ++d, iterator.inc_d()) {
-    for (int h = 0; h < OutputIterator::Iterations::kH; ++h, iterator.inc_h()) {
-      for (int w = 0; w < OutputIterator::Iterations::kW; ++w, iterator.inc_w()) {
-        if (predicate_adapter.at(d, h, w, 0)) {
-          int idx = OutputIterator::Tile::kC *
-                    (w + OutputIterator::Iterations::kW * (h + OutputIterator::Iterations::kH * d));
-
-          Store<typename Fragment::Element,
-                OutputIterator::Tile::kC,
-                OutputIterator::kMemorySpace>::
-              store(reinterpret_cast<typename OutputIterator::AccessType const &>(fragment[idx]),
-                    iterator.data(),
-                    offset);
-        }
-      }
-    }
-  }
-}
-
-/// Stores a fragment to an output iterator
-template <typename OutputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
-                                                       Fragment const &fragment,
-                                                       typename OutputIterator::Index offset = 0) {
-  TrivialPredicateTileAdapter pred;
-  iterator_store_post_increment(iterator, fragment, offset, pred);
-}
-
-/// Stores a fragment to an output iterator
-template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
-                                                       Fragment const &fragment,
-                                                       ConstPredicateAdapter pred_it) {
-  iterator_store_post_increment(iterator, fragment, 0, pred_it);
-}
-
-/// Stores a fragment to an output iterator, masked by a predicate iterator
-template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &_iterator,
-                                        Fragment const &fragment,
-                                        typename OutputIterator::Index offset,
-                                        ConstPredicateAdapter predicate_adapter) {
-  OutputIterator iterator(_iterator);
-  iterator_store_post_increment(iterator, fragment, offset, predicate_adapter);
-}
-
-/// Stores a fragment to an output iterator
-template <typename OutputIterator, typename Fragment>
-CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &iterator,
-                                        Fragment const &fragment,
-                                        typename OutputIterator::Index offset = 0) {
-  TrivialPredicateTileAdapter pred;
-  iterator_store(iterator, fragment, offset, pred);
-}
-
-/// Stores a fragment to an output iterator
-template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
-CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &iterator,
-                                        Fragment const &fragment,
-                                        ConstPredicateAdapter pred_it) {
-  iterator_store(iterator, fragment, 0, pred_it);
-}
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/load_store.h
+++ b/cutlass/load_store.h
@ -1,199 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines abstractions for efficiently loading and storing vectors to memory.
-*/
-#pragma once
-
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Enum to specify which memory space data resides in.
-*/
-struct MemorySpace {
-  enum Kind {
-    kGeneric,  // Data accessed through pointer dereferencing
-    kShared,   // Data resides in shared memory
-    kGlobal    // Data resides in global memory
-  };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          int Lanes_,
-          MemorySpace::Kind Memory_,
-          bool = (Lanes_ > 1),
-          size_t = (sizeof(Scalar_) * Lanes_)>
-struct Load {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The load function.
-  static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
-    dst = reinterpret_cast<AccessType const*>(&pointer[offset])[0];
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Load<Scalar_, Lanes_, Memory_, true, 4> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
-    dst.registers[0] = reinterpret_cast<uint32_t const*>(&pointer[offset])[0];
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Load<Scalar_, Lanes_, Memory_, true, 8> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
-    uint2 tmp = reinterpret_cast<uint2 const*>(&pointer[offset])[0];
-    dst.registers[0] = tmp.x;
-    dst.registers[1] = tmp.y;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MemorySpace::Kind Memory_>
-struct Load<double, 2, Memory_, true, 16> {
-  /// The output type.
-  typedef typename Vectorize<double, 2>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void load(AccessType& dst, double const* pointer, int offset) {
-    double2 tmp = reinterpret_cast<double2 const*>(&pointer[offset])[0];
-    dst[0] = tmp.x;
-    dst[1] = tmp.y;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Load<Scalar_, Lanes_, Memory_, true, 16> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
-    uint4 tmp = reinterpret_cast<uint4 const*>(&pointer[offset])[0];
-    dst.registers[0] = tmp.x;
-    dst.registers[1] = tmp.y;
-    dst.registers[2] = tmp.z;
-    dst.registers[3] = tmp.w;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_,
-          int Lanes_,
-          MemorySpace::Kind Memory_,
-          bool = (Lanes_ > 1),
-          size_t = (sizeof(Scalar_) * Lanes_)>
-struct Store {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
-    pointer[offset] = src;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Store<Scalar_, Lanes_, Memory_, true, 4> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
-    uint32_t* addr = reinterpret_cast<uint32_t*>(&pointer[offset]);
-    addr[0] = src.registers[0];
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Store<Scalar_, Lanes_, Memory_, true, 8> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
-    uint2* addr = reinterpret_cast<uint2*>(&pointer[offset]);
-    addr[0] = make_uint2(src.registers[0], src.registers[1]);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <MemorySpace::Kind Memory_>
-struct Store<double, 2, Memory_, true, 16> {
-  /// The output type.
-  typedef typename Vectorize<double, 2>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& src, double* pointer, int offset) {
-    double2* addr = reinterpret_cast<double2*>(&pointer[offset]);
-    addr[0] = make_double2(src[0], src[1]);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
-struct Store<Scalar_, Lanes_, Memory_, true, 16> {
-  /// The output type.
-  typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
-
-  /// The store function.
-  static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
-    uint4* addr = reinterpret_cast<uint4*>(&pointer[offset]);
-    addr[0] = make_uint4(src.registers[0], src.registers[1], src.registers[2], src.registers[3]);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/matrix_traits.h
+++ b/cutlass/matrix_traits.h
@ -1,48 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines properties of matrices used to denote layout and operands to GEMM kernels.
-*/
-#pragma once
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Describes layouts of matrices
-struct MatrixLayout {
-  enum Kind { kRowMajor, kColumnMajor };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Gemm operand - D = A * B + C
-struct GemmOperand {
-  enum Kind { kA, kB, kC, kD };
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/predicate_vector.h
+++ b/cutlass/predicate_vector.h
@ -1,493 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines container classes and iterators for managing a statically sized vector
-      of boolean predicates.
-*/
-#pragma once
-
-#include <stdint.h>
-
-#include <cutlass/cutlass.h>
-#include <cutlass/shape.h>
-
-#include <cutlass/util/platform.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup predicate_vector_concept Predicate Vector Concept
-@{
-
-Implementations of \ref predicate_vector_concept contain an ordered set of boolean predicates which
-may be used as conditionals in other device-side operations. Both random access and iterators
-offering sequential access are provided.
-
-@par Predicate Vector
-   A \ref predicate_vector_concept satisfies the following expressions
-  - <b>at(int idx)</b> - returns the value of the indexed predicate
-  - <b>set(int idx, bool value)</b> - sets the value of the indexed predicate
-  - <b>begin()</b> - returns a \ref predicate_iterator_concept pointing to the first predicate
-
-@}
-*/
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup predicate_iterator_concept Predicate Iterator Concept
-@{
-
-Implementations of \ref predicate_iterator_concept enables accessing and traversing elements of a
-bit vector.
-
-@par Const Predicate Iterator
-  A const \ref predicate_iterator_concept satisfies the following expressions
- - <b>++it</b> increments the iterator to the next predicate
- - <b>*it</b> returns the value of the currently pointed-to predicate
-
-@par Mutable Predicate Iterator
- A \ref predicate_iterator_concept that is non-const <b>also</b> satisfies the following expressions
- - <b>it.set(bool value)</b> sets the value of the currently pointed-to predicate
-
-@}
-*/
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup predicate_tile_adapter Predicate Tile Adapter Concept
-@{
-
-Implementations of \ref predicate_tile_adapter provide a mapping between a the elements of a \ref
-tile_traits_concept and a \ref predicate_vector_concept.
-
-@par Predicate Tile Adapter
-  A \ref predicate_tile_adapter satisfies the following expressions
- - <b>at(int d, int h, int w, int c)</b> - returns the value of a predicate corresponding to the
-   access (d, h, w, c) within the tile.
-
-@}
-*/
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Statically sized array of bits implementing @concept{predicate_vector_concept}.
-template <
-    /// Number of predicates conatined in predicate vector
-    int kPredicates_,
-    /// Number of predicates contained in each byte of internal storage
-    int kPredicatesPerByte_ = 4,
-    /// Location of first predicate within byte of internal storage
-    int kPredicateStart_ = 0>
-struct PredicateVector {
-  /// Number of bits stored by the PredicateVector
-  static int const kPredicates = kPredicates_;
-
-  /// Number of bits stored within each byte of the predicate bit vector
-  static int const kPredicatesPerByte = kPredicatesPerByte_;
-
-  /// First bit withing each byte containing predicates
-  static int const kPredicateStart = kPredicateStart_;
-
-  // Make sure no one tries to put more than 8 bits in a byte :)
-  static_assert(kPredicatesPerByte <= 8, "kPredicatesPerByte must fit within an actual byte");
-  // Make sure the "offsetted" bits fit in one byte.
-  static_assert(kPredicateStart + kPredicatesPerByte < 8,
-                "The offsetted predicates must fit within an actual byte.");
-
-  /// Storage type of individual elements
-  typedef uint32_t Storage;
-
-  /// Number of bytes needed
-  static int const kBytes = (kPredicates + kPredicatesPerByte - 1) / kPredicatesPerByte;
-
-  /// Number of storage elements needed
-  static int const kWordCount = (kBytes + sizeof(Storage) - 1) / sizeof(Storage);
-
- private:
-  //
-  // Data members
-  //
-
-  /// Words of bit vector
-  Storage storageData[kWordCount];
-
-  //
-  // Methods
-  //
-
-  /// Computes the word and bit corresponding to a logical predicate index
-  CUTLASS_HOST_DEVICE void computeStorageOffset(int &word, int &bit, int idx) const {
-    CUTLASS_ASSERT(idx < kPredicates);
-
-    int byte = (idx / kPredicatesPerByte);
-    int bit_offset = (idx % kPredicatesPerByte);
-
-    word = byte / sizeof(Storage);
-    int byte_offset = (byte % sizeof(Storage));
-
-    bit = byte_offset * 8 + bit_offset + kPredicateStart;
-  }
-
-  /// Accesses a given word with optional assertions
-  CUTLASS_HOST_DEVICE Storage &storage(int word) {
-    CUTLASS_ASSERT(word < kWordCount);
-    return storageData[word];
-  }
-
-  /// Accesses a given word with optional assertions
-  CUTLASS_HOST_DEVICE Storage const &storage(int word) const {
-    CUTLASS_ASSERT(word < kWordCount);
-    return storageData[word];
-  }
-
- public:
-  //
-  // Iterator
-  //
-
-  /**
-  * @brief A const iterator implementing \ref predicate_iterator_concept enabling sequential
-  * read-only access to prediactes.
-  * @concept{predicate_iterator_concept}
-  */
-  class ConstIterator {
-    /// Reference to PredicateVector instance
-    PredicateVector const &vec_;
-
-    /// Index into PredicateVector
-    int bit_;
-
-   public:
-    /// Copy constructor
-    CUTLASS_HOST_DEVICE
-    ConstIterator(ConstIterator const &it) : vec_(it.vec_), bit_(it.bit_) {}
-
-    ///
-    CUTLASS_HOST_DEVICE
-    ConstIterator(PredicateVector const &_vec, int _start = 0) : vec_(_vec), bit_(_start) {}
-
-    /// Pre-increment
-    CUTLASS_HOST_DEVICE
-    ConstIterator &operator++() {
-      ++bit_;
-      return *this;
-    }
-
-    /// Pre-decrement
-    CUTLASS_HOST_DEVICE
-    ConstIterator &operator--() {
-      --bit_;
-      return *this;
-    }
-
-    /// Post-increment
-    CUTLASS_HOST_DEVICE
-    ConstIterator operator++(int) {
-      ConstIterator ret(*this);
-      ret.bit_++;
-      return ret;
-    }
-
-    /// Post-decrement
-    CUTLASS_HOST_DEVICE
-    ConstIterator operator--(int) {
-      ConstIterator ret(*this);
-      ret.bit_--;
-      return ret;
-    }
-
-    /// Returns true if iterators point to the same bit
-    CUTLASS_HOST_DEVICE
-    bool operator==(ConstIterator const &it) const { return bit_ == it.bit_; }
-
-    /// Returns false if iterators point to the same bit
-    CUTLASS_HOST_DEVICE
-    bool operator!=(ConstIterator const &it) const { return bit_ != it.bit_; }
-
-    /// Dereferences iterator
-    CUTLASS_HOST_DEVICE
-    bool operator*() const { return vec_[bit_]; }
-  };
-
-  /**
-  * @brief An iterator implementing \ref predicate_iterator_concept enabling sequential
-  * read and write access to predicates.
-  * @concept{predicate_iterator_concept}
-  */
-  class Iterator {
-    /// Reference to PredicateVector instance
-    PredicateVector &vec_;
-
-    /// Index into PredicateVector
-    int bit_;
-
-   public:
-    /// Copy constructor
-    CUTLASS_HOST_DEVICE
-    Iterator(Iterator const &it) : vec_(it.vec_), bit_(it.bit_) {}
-
-    /// Constructs an iterator from a PredicateVector
-    CUTLASS_HOST_DEVICE
-    Iterator(PredicateVector &_vec, int _start = 0) : vec_(_vec), bit_(_start) {}
-
-    /// Pre-increment
-    CUTLASS_HOST_DEVICE
-    Iterator &operator++() {
-      ++bit_;
-      return *this;
-    }
-
-    /// Pre-decrement
-    CUTLASS_HOST_DEVICE
-    Iterator &operator--() {
-      --bit_;
-      return *this;
-    }
-
-    /// Post-increment
-    CUTLASS_HOST_DEVICE
-    Iterator operator++(int) {
-      Iterator ret(*this);
-      ret.bit_++;
-      return ret;
-    }
-
-    /// Post-decrement
-    CUTLASS_HOST_DEVICE
-    Iterator operator--(int) {
-      Iterator ret(*this);
-      ret.bit_--;
-      return ret;
-    }
-
-    /// Returns true if iterators point to the same bit
-    CUTLASS_HOST_DEVICE
-    bool operator==(Iterator const &it) const { return bit_ == it.bit_; }
-
-    /// Returns false if iterators point to the same bit
-    CUTLASS_HOST_DEVICE
-    bool operator!=(Iterator const &it) const { return bit_ != it.bit_; }
-
-    /// Gets the bit at the pointed to location
-    CUTLASS_HOST_DEVICE
-    bool get() { return vec_[bit_]; }
-
-    /// Dereferences iterator
-    CUTLASS_HOST_DEVICE
-    bool operator*() const { return vec_[bit_]; }
-
-    /// Sets the bit at the pointed to location
-    CUTLASS_HOST_DEVICE
-    void set(bool value = true) { vec_.set(bit_, value); }
-  };
-
-  /// Iterator that always returns true
-  struct TrivialIterator {
-    /// Constructor
-    CUTLASS_HOST_DEVICE
-    TrivialIterator() {}
-
-    /// Copy constructor
-    CUTLASS_HOST_DEVICE
-    TrivialIterator(Iterator const &it) {}
-
-    /// Constructs an iterator from a PredicateVector
-    CUTLASS_HOST_DEVICE
-    TrivialIterator(PredicateVector const &_vec) {}
-
-    /// Pre-increment
-    CUTLASS_HOST_DEVICE
-    TrivialIterator &operator++() { return *this; }
-
-    /// Post-increment
-    CUTLASS_HOST_DEVICE
-    TrivialIterator operator++(int) { return *this; }
-
-    /// Dereferences iterator
-    CUTLASS_HOST_DEVICE
-    bool operator*() const { return true; }
-  };
-
- public:
-  //
-  // Methods
-  //
-
-  /// Initialize the predicate vector
-  CUTLASS_HOST_DEVICE PredicateVector(bool value = true) { fill(value); }
-
-  /// Fills all predicates with a given value
-  CUTLASS_HOST_DEVICE void fill(bool value = true) {
-    Storage item = (value ? ~Storage(0) : Storage(0));
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < kWordCount; ++i) {
-      storage(i) = item;
-    }
-  }
-
-  /// Accesses a bit within the predicate vector.
-  CUTLASS_HOST_DEVICE bool operator[](int idx) const { return at(idx); }
-
-  /// Accesses a bit within the predicate vector.
-  CUTLASS_HOST_DEVICE bool at(int idx) const {
-    int bit, word;
-    computeStorageOffset(word, bit, idx);
-
-    return ((storage(word) >> bit) & 1);
-  }
-
-  /// Set a bit within the predicate vector.
-  CUTLASS_HOST_DEVICE void set(int idx, bool value = true) {
-    int bit, word;
-    computeStorageOffset(word, bit, idx);
-
-    Storage disable_mask = (~(Storage(1) << bit));
-    Storage enable_mask = (Storage(value) << bit);
-
-    storage(word) = ((storage(word) & disable_mask) | enable_mask);
-  }
-
-  /// Computes the intersection of two identical predicate vectors.
-  CUTLASS_HOST_DEVICE PredicateVector &operator&=(PredicateVector const &predicates) {
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < kWordCount; ++i) {
-      storage(i) = (storage(i) & predicates.storage(i));
-    }
-    return *this;
-  }
-
-  /// Computes the union of two identical predicate vectors.
-  CUTLASS_HOST_DEVICE PredicateVector &operator|=(PredicateVector const &predicates) {
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < kWordCount; ++i) {
-      storage(i) = (storage(i) | predicates.storage(i));
-    }
-    return *this;
-  }
-
-  /// Returns true if entire predicate array is zero.
-  CUTLASS_HOST_DEVICE bool is_zero() const {
-    Storage mask(0);
-    for (int byte = 0; byte < sizeof(Storage); ++byte) {
-      Storage byte_mask = (((1 << kPredicatesPerByte) - 1) << kPredicateStart);
-      mask |= (byte_mask << (byte * 8));
-    }
-    uint32_t result = 0;
-    for (int word = 0; word < kWordCount; ++word) {
-      result |= storage(word);
-    }
-    return result == 0;
-  }
-
-  /// Returns an iterator to the start of the bit vector
-  CUTLASS_DEVICE
-  Iterator begin() { return Iterator(*this); }
-
-  /// Returns an iterator
-  CUTLASS_DEVICE
-  Iterator end() { return Iterator(*this, kPredicates); }
-
-  /// Returns a ConstIterator
-  CUTLASS_DEVICE
-  ConstIterator const_begin() const { return ConstIterator(*this); }
-
-  /// Returns a ConstIterator
-  CUTLASS_DEVICE
-  ConstIterator const_end() const { return ConstIterator(*this, kPredicates); }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Always returns true predicate.
-struct TrivialPredicateTileAdapter {
-  /// Ctor.
-  CUTLASS_HOST_DEVICE TrivialPredicateTileAdapter() {}
-
-  /// The value at location (d, h, w, c).
-  CUTLASS_HOST_DEVICE bool at(int, int, int, int) const { return true; }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to enable random access to predicates via logical coordinate within a tile.
-template <typename PredicateVector_, typename Iterations_>
-struct PredicateTileAdapter {
-  /// The vector of predicates.
-  typedef PredicateVector_ PredicateVector;
-  /// The iterations.
-  typedef Iterations_ Iterations;
-
- private:
-  /// The predicates.
-  PredicateVector &predicates;
-
- public:
-  /// Ctor.
-  CUTLASS_DEVICE PredicateTileAdapter(PredicateVector &predicates_) : predicates(predicates_) {}
-
-  /// Get the value at location (d, h, w, c).
-  CUTLASS_DEVICE bool at(int d, int h, int w, int c) const {
-    int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
-    return predicates.at(bit);
-  }
-
-  /// Set the value at location (d, h, w, c).
-  CUTLASS_DEVICE void set(int d, int h, int w, int c, bool value) {
-    int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
-    predicates.set(bit, value);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to enable random access to predicates via logical coordinate within a tile.
-template <typename PredicateVector_, typename Iterations_>
-struct ConstPredicateTileAdapter {
-  /// The vector of predicates.
-  typedef PredicateVector_ PredicateVector;
-  /// The iterations.
-  typedef Iterations_ Iterations;
-
- private:
-  /// The predicates.
-  PredicateVector const &predicates;
-
- public:
-  /// Ctor.
-  CUTLASS_DEVICE ConstPredicateTileAdapter(PredicateVector const &predicates_)
-      : predicates(predicates_) {}
-
-  /// Get the value at location (d, h, w, c).
-  CUTLASS_DEVICE bool at(int d, int h, int w, int c) const {
-    int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
-    return predicates.at(bit);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/reshape_tile.h
+++ b/cutlass/reshape_tile.h
@ -1,58 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines a type for restructuring a tile.
-*/
-#pragma once
-
-#include <cutlass/shape.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// The following functor reshapes a tile of data. The goal is to have at least kAccessSize in
-// the inner-most dimension. If the user respects that constraint, there is nothing to be done. If
-// that's not the case, this functor will correct that and "extract" the right number of elements
-// from the next dimension.
-
-template <typename Tile_, int kAccessSize_, bool = (Tile_::kC < kAccessSize_)>
-struct ReshapeTile {
-  typedef Tile_ Tile;
-};
-
-template <typename Tile_, int kAccessSize_>
-struct ReshapeTile<Tile_, kAccessSize_, true> {
-  // Make sure the W dimension of the tile is large enough.
-  static_assert(Tile_::kW >= kAccessSize_, "The W dimension is too small");
-  // Make sure the dimension can be divided by the number of scalars.
-  static_assert(Tile_::kW % kAccessSize_ == 0, "Not supported");
-  // Collapse the W dimension.
-  typedef Shape<Tile_::kD, Tile_::kH, Tile_::kW / kAccessSize_, kAccessSize_> Tile;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/shape.h
+++ b/cutlass/shape.h
@ -1,301 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines Shape implementing the Layout concept for representing a 4D hypercube of objects.
-*/
-#pragma once
-
-#include <cutlass/cutlass.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup layout_concept Layout Concept
-* @{
-* @par Implementations of \ref layout_concept are used to describe a cube with DxHxW elements and C
-scalars per element.
- A HxW slice of a cube is called an image and a cube consists of D images.
-*
-* @par Notations
-*   Let Layout be an implementation of the \ref layout_concept.
-*
-* @par Valid Expressions
-* - <b>Layout::D</b> specifies the depth of a cube
-* - <b>Layout::H</b> specifies the height of a cube
-* - <b>Layout::W</b> specifies the height of a cube
-* - <b>Layout::C</b> specifies the number of channels of each element in a cube
-* - <b>Layout::W_c</b> specifies the number of scalars of each row in one image of a cube.
-* - <b>Layout::H_w</b> specifies the number of elements in an image slice.
-* - <b>Layout::H_w_c</b>_specifies the number of scalars in an image slice.
-* - <b>Layout::D_h_w</b> specifies the number of elements in a cube.
-* - <b>Layout::D_h_w_c</b> specifies the number of scalars in a cube.
-* - <b>Layout::Strides</b> is a \ref layout_concept specifying the strides.
-* @}
-*/
-
-/**
-* @brief A Shape implementing \ref layout_concept describing the dimensions of a cube.
-* @concept{layout_concept}
-*/
-template <int kD_ = 1, int kH_ = 1, int kW_ = 1, int kC_ = 1>
-struct Shape {
-  /// The depth of the cube.
-  static int const kD = kD_;
-  /// The height of the cube.
-  static int const kH = kH_;
-  /// The width of the cube.
-  static int const kW = kW_;
-  /// The number of scalars per element.
-  static int const kC = kC_;
-};
-
-/**
-* @brief Compute derived counted of a \ref layout_concept based class
-*/
-template <typename Shape>
-struct ShapeCount {
-  /// The number of elements per row.
-  static int const kWc = Shape::kW * Shape::kC;
-  /// The number of pixels per image.
-  static int const kHw = Shape::kH * Shape::kW;
-  /// The number of elements per image.
-  static int const kHwc = Shape::kH * kWc;
-  /// The number of pixels per cube.
-  static int const kDhw = Shape::kD * kHw;
-  /// The number of elements in the 4D space.
-  static int const kDhwc = Shape::kD * kHwc;
-  /// The number of elements in the 4D space.
-  static int const kCount = kDhwc;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, int kScale_>
-struct ShapeScale {
-  typedef Shape<A_::kD * kScale_, A_::kH * kScale_, A_::kW * kScale_, A_::kC * kScale_> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeAdd {
-  typedef Shape<A_::kD + B_::kD, A_::kH + B_::kH, A_::kW + B_::kW, A_::kC + B_::kC> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeSub {
-  typedef Shape<A_::kD - B_::kD, A_::kH - B_::kH, A_::kW - B_::kW, A_::kC - B_::kC> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeMul {
-  typedef Shape<A_::kD * B_::kD, A_::kH * B_::kH, A_::kW * B_::kW, A_::kC * B_::kC> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeDiv {
-  typedef Shape<A_::kD / B_::kD, A_::kH / B_::kH, A_::kW / B_::kW, A_::kC / B_::kC> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeMax {
-  typedef Shape<(A_::kD > B_::kD ? A_::kD : B_::kD),
-                (A_::kH > B_::kH ? A_::kH : B_::kH),
-                (A_::kW > B_::kW ? A_::kW : B_::kW),
-                (A_::kC > B_::kC ? A_::kC : B_::kC)>
-      Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename A_, typename B_>
-struct ShapeMin {
-  typedef Shape<(A_::kD < B_::kD ? A_::kD : B_::kD),
-                (A_::kH < B_::kH ? A_::kH : B_::kH),
-                (A_::kW < B_::kW ? A_::kW : B_::kW),
-                (A_::kC < B_::kC ? A_::kC : B_::kC)>
-      Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Shape_>
-struct ShapeStrides {
-  typedef Shape<Shape_::kH * Shape_::kW * Shape_::kC, Shape_::kW * Shape_::kC, Shape_::kC, 1> Shape;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube
-* @tparam A \ref layout_concept where each dimension of the cube specifies the corresponding stride.
-*/
-template <typename Shape_>
-struct ComputeOffsetFromShape {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
-    // clang-format off
-    return d * Shape_::kH * Shape_::kW * Shape_::kC +
-           h * Shape_::kW * Shape_::kC +
-           w * Shape_::kC +
-           c;
-    // clang-format on
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube with a depth of 1
-* @tparam kSh Elements in the H dimension
-* @tparam kSw Elements in the W dimension
-* @tparam kSc Separation between two elements in "elements"
-*/
-template <int kSh_, int kSw_, int kSc_>
-struct ComputeOffsetFromShape<Shape<1, kSh_, kSw_, kSc_> > {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
-    return h * kSw_ * kSc_ + w * kSc_ + c;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube with one channel and a depth of 1
-* @tparam kSh Elements in the H dimension
-* @tparam kSw Elements in the W dimension
-*/
-template <int kSh_, int kSw_>
-struct ComputeOffsetFromShape<Shape<1, kSh_, kSw_, 1> > {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) { return h * kSw_ + w; }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube
-* @tparam A \ref layout_concept where each dimension of the cube specifies the corresponding stride.
-*/
-template <typename Strides_>
-struct ComputeOffsetFromStrides {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
-    return d * Strides_::kD + h * Strides_::kH + w * Strides_::kW + c * Strides_::kC;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube with a depth of 1
-* @tparam S_h Stride in the H dimension in scalars
-* @tparam S_w Stride in the W dimension in scalars
-* @tparam S_c Stride between two scalars.
-*/
-template <int S_h_, int S_w_, int S_c_>
-struct ComputeOffsetFromStrides<Shape<1, S_h_, S_w_, S_c_> > {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
-    return h * S_h_ + w * S_w_ + c * S_c_;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Compute the offset for the given coordinates in a cube with one channel and a depth of 1
-* @tparam S_h Stride in the H dimension in scalars
-* @tparam S_w Stride in the W dimension in scalars
-*/
-template <int S_h_, int S_w_>
-struct ComputeOffsetFromStrides<Shape<1, S_h_, S_w_, 1> > {
-  static CUTLASS_DEVICE int get(int d, int h, int w, int c) { return h * S_h_ + w * S_w_; }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief Decompose threadId.x into coordinate of a cube whose dimensions are specified by Threads_.
-* Afterwards compute the offset of those coordinates using Strides_
-* @tparam Threads_ The dimension of the cube the threadIdx.x value is mapped on
-* @tparam Strides_ The strides to use when compute the offsets based on the coordinates of the cube.
-*/
-template <typename Threads_, typename Strides_>
-struct ComputeThreadOffsetFromStrides {
-  static CUTLASS_DEVICE int get() {
-    // Decompose the thread index.
-    int c = threadIdx.x % Threads_::kC;
-    int w = threadIdx.x / Threads_::kC % Threads_::kW;
-    int h = threadIdx.x / Threads_::kC / Threads_::kW % Threads_::kH;
-    int d = threadIdx.x / Threads_::kC / Threads_::kW / Threads_::kH;
-
-    // Compute the offset.
-    return d * Strides_::kD + h * Strides_::kH + w * Strides_::kW + c * Strides_::kC;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-/**
-*@brief Specialization for D=1
-*/
-template <int T_h_, int T_w_, int T_c_, int S_h_, int S_w_, int S_c_>
-struct ComputeThreadOffsetFromStrides<Shape<1, T_h_, T_w_, T_c_>, Shape<1, S_h_, S_w_, S_c_> > {
-  static CUTLASS_DEVICE int get() {
-    // Decompose the thread index.
-    int c = threadIdx.x % T_c_;
-    int w = threadIdx.x / T_c_ % T_w_;
-    int h = threadIdx.x / T_c_ / T_w_ % T_h_;
-
-    // Compute the offset.
-    return h * S_h_ + w * S_w_ + c * S_c_;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-*@brief Specialization for D=1 and C=1
-*/
-template <int T_h_, int T_w_, int S_h_, int S_w_>
-struct ComputeThreadOffsetFromStrides<Shape<1, T_h_, T_w_, 1>, Shape<1, S_h_, S_w_, 1> > {
-  static CUTLASS_DEVICE int get() {
-    // Decompose the thread index.
-    int w = threadIdx.x % T_w_;
-    int h = threadIdx.x / T_w_;
-
-    // Compute the offset.
-    return h * S_h_ + w * S_w_;
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/tensor_ref.h
+++ b/cutlass/tensor_ref.h
@ -1,151 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines a structure containing strides, bounds, and a pointer to tensor data.
-*/
-#pragma once
-
-#include <typeinfo>
-
-#include <cutlass/coord.h>
-#include <cutlass/cutlass.h>
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Structure modeling a pointer and stride into a tensor
-template <typename Storage_, int Rank_>
-class TensorRef {
- public:
-  /// Data type of individual access
-  typedef Storage_ Storage;
-
-  /// Rank of tensor
-  static int const Rank = Rank_;
-
- private:
-  //
-  // Data members
-  //
-
-  /// Pointer to storage element
-  Storage* ptr_;
-
-  /// Stride information
-  Coord<Rank> stride_;
-
- public:
-  //
-  // Methods
-  //
-
-  /// Default ctor
-  CUTLASS_HOST_DEVICE
-  TensorRef() : ptr_(nullptr) {}
-
-  /// Constructs from a pointer, size, and stride
-  CUTLASS_HOST_DEVICE
-  TensorRef(Storage* ptr, Coord<Rank> stride) : ptr_(ptr), stride_(stride) {}
-
-  /// Updates the pointer, stride, and location within a TensorRef
-  CUTLASS_HOST_DEVICE
-  void reset(Storage* ptr = nullptr, Coord<Rank> stride = Coord<Rank>(0)) {
-    ptr_ = ptr;
-    stride_ = stride;
-  }
-
-  /// Conversion function
-  template <typename T>
-  TensorRef<T, Rank> convert() {
-    Coord<Rank> converted_stride;
-    for (int i = 0; i < Rank - 1; ++i) {
-      converted_stride[i] = stride_[i] * Extent<Storage>::kValue / Extent<T>::kValue;
-    }
-    converted_stride[Rank - 1] = stride_[Rank - 1];
-
-    return TensorRef<T, Rank>(reinterpret_cast<T*>(ptr_), converted_stride);
-  }
-
-  /// Returns true if the TensorRef may be safely accessed
-  CUTLASS_HOST_DEVICE
-  bool good() const { return ptr_ != nullptr; }
-
-  /// Returns the pointer to referenced data
-  CUTLASS_HOST_DEVICE
-  Storage* data() const { return ptr_; }
-
-  /// Returns the stride of the tensor
-  CUTLASS_HOST_DEVICE
-  Coord<Rank> const& stride() const { return stride_; }
-
-  /// Returns the stride of the tensor in the given dimension
-  CUTLASS_HOST_DEVICE
-  int const& stride(int dim) const { return stride_.at(dim); }
-
-  /// Returns the maximum stride element as the 'leading dimension'
-  CUTLASS_HOST_DEVICE
-  int leading_dim() const { return __NV_STD_MAX(stride_[1], stride_[2]); }
-
-  /// Computes the offset of an index from the origin of the tensor
-  CUTLASS_HOST_DEVICE
-  long long offset(Coord<Rank> const& coord) const {
-    return stride_.template dot<long long>(coord);
-  }
-
-  /// Returns a reference to the element at a given Coord
-  CUTLASS_HOST_DEVICE
-  Storage& at(Coord<Rank> const& coord) const { return ptr_[offset(coord)]; }
-
-  /// Element-wise accessor
-  Storage& operator[](Coord<Rank> const& coord) const { return at(coord); }
-
-  /// Returns a reference to the element at a given Coord
-  CUTLASS_HOST_DEVICE
-  Storage& at(int idx) const { return ptr_[idx]; }
-
-  /// Element-wise accessor
-  Storage& operator[](int idx) const { return at(idx); }
-
-  /// Adds an offset to the pointer
-  CUTLASS_HOST_DEVICE
-  TensorRef& advance(Coord<Rank> const& b) {
-    ptr_ += offset(b);
-    return *this;
-  }
-
-  /// Returns a TensorRef offset by a given amount
-  CUTLASS_HOST_DEVICE
-  TensorRef operator+(Coord<Rank> const& b) const { return TensorRef(ptr_ + offset(b), stride_); }
-
-  /// Returns a TensorRef offset by a given amount
-  CUTLASS_HOST_DEVICE
-  TensorRef operator-(Coord<Rank> const& b) const { return TensorRef(ptr_ - offset(b), stride_); }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/tensor_view.h
+++ b/cutlass/tensor_view.h
@ -1,172 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines a structure containing strides and a pointer to tensor data.
-*/
-
-#pragma once
-
-#include <cmath>
-
-#include <cutlass/cutlass.h>
-#include <cutlass/tensor_ref.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Host-side reference implementation of tensor operations
-template <typename T>
-class TensorView : public TensorRef<T, 4> {
- public:
-  /// Reference and stride
-  typedef TensorRef<T, 4> Base;
-
-  /// Reference and stride
-  typedef Base TensorRef_t;
-
-  /// Reference to constant type
-  typedef TensorRef<T const, 4> ConstTensorRef_t;
-
-  /// Rank of tensor
-  static int const Rank = TensorRef_t::Rank;
-
-  /// Type used to compute the offset of an element to the base of a tensor
-  typedef int Offset_t;
-
-  /// Coordinate into tensor
-  typedef Coord<Rank> Coord_t;
-
- private:
-  //
-  // Data members
-  //
-
-  /// Pointer to pitch-linear memory
-  TensorRef_t ref_;
-
-  /// Dimensions of coordinate (independent of stride)
-  Coord_t size_;
-
- public:
-  //
-  // Device and Host Methods
-  //
-
-  /// Default constructor
-  CUTLASS_HOST_DEVICE
-  TensorView() {}
-
-  /// Constructs a Tensor_view from a TensorRef and size
-  CUTLASS_HOST_DEVICE
-  TensorView(TensorRef_t const& _ref, Coord_t const& _size) : Base(_ref), size_(_size) {}
-
-  /// Returns true if the Tensor_view is bound to some memory
-  CUTLASS_HOST_DEVICE
-  bool good() const { return ref().good(); }
-
-  /// Returns a pointer to data
-  CUTLASS_HOST_DEVICE
-  T* data() const { return ref().data(); }
-
-  /// Updates the reference and size of a Tensor_view object
-  CUTLASS_HOST_DEVICE
-  void reset(TensorRef_t const& _ref = TensorRef_t(0), Coord_t const& _size = Coord_t()) {
-    Base::operator=(_ref);
-    size_ = _size;
-  }
-
-  /// Accesses the tensor reference pointing to data
-  CUTLASS_HOST_DEVICE
-  TensorRef_t& ref() { return *this; }
-
-  ///
-  CUTLASS_HOST_DEVICE
-  ConstTensorRef_t const_ref() { return ConstTensorRef_t(data(), stride()); }
-
-  /// Accesses the tensor reference pointing to data
-  CUTLASS_HOST_DEVICE
-  TensorRef_t const& ref() const { return *this; }
-
-  /// Accesses the size
-  CUTLASS_HOST_DEVICE
-  Coord_t const& size() const { return size_; }
-
-  /// Accesses the size
-  CUTLASS_HOST_DEVICE
-  int size(int dim) const { return size_.at(dim); }
-
-  /// Accesses the stride
-  CUTLASS_HOST_DEVICE
-  Coord_t const& stride() const { return ref().stride(); }
-
-  /// Accesses the stride
-  CUTLASS_HOST_DEVICE
-  int const& stride(int dim) const { return ref().stride(dim); }
-
-  /// Assigns the Tensor_view
-  CUTLASS_HOST_DEVICE
-  TensorView& operator=(TensorView const& _tensor) {
-    Base::operator=(_tensor._ref);
-    size_ = _tensor.size_;
-    return *this;
-  }
-
-  /// Returns the index of an element
-  CUTLASS_HOST_DEVICE
-  Offset_t offset(Coord_t const& coord) const { return ref().offset(coord); }
-
-  /// Determines whether a location is within a tensor
-  CUTLASS_HOST_DEVICE
-  bool contains(Coord_t const& coord) const {
-    for (int dim = 0; dim < Rank; ++dim) {
-      if (coord.at(dim) >= size_.at(dim)) {
-        return false;
-      }
-    }
-    return true;
-  }
-
-  /// Element-wise accessor
-  CUTLASS_HOST_DEVICE
-  T& at(Coord_t const& coord) const { return ref().at(coord); }
-
-  /// Element-wise accessor
-  T& operator[](Coord<Rank> const& coord) const { return at(coord); }
-
-  /// Element-wise accessor
-  CUTLASS_HOST_DEVICE
-  T& at(Offset_t idx) const { return ref().at(idx); }
-
-  /// Returns a Tensor_view given location and size quantities
-  CUTLASS_HOST_DEVICE
-  TensorView<T> subview(Coord_t const& location, Coord_t size) const {
-    return TensorView<T>(ref() + location, size.clamp(size_ - location));
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/tile_iterator.h
+++ b/cutlass/tile_iterator.h
@ -1,881 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines the Tile Traits concept and iterators for loading and storing to tiles
-   efficiently.
-*/
-#pragma once
-
-#include <cutlass/fragment.h>
-#include <cutlass/load_store.h>
-#include <cutlass/predicate_vector.h>
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup tile_traits_concept Tile Traits Concept
-@{
-
-\ref tile_traits_concept is a type definining the shape of a tile and the distribution of accesses
-by individual entities, either threads or other.
-
-@par Tile Traits Concept
-   Types satisfying \ref tile_traits_concept define the following members
-  - <b>Tile</b> - a type satisfying \ref layout_concept describing the dimensions of the tile
-  - <b>Delta</b> - a type satisfying \ref layout_concept describing the increments between accesses
-along each dimension
-  - <b>Iterations</b> - a type satisfying \ref layout_concept describing the number of accesses
-along each dimension
-  - <b>Offset</b> - the type of a <i>functor</i> computing the offset of each participating entity
-as a Coord<4>.
-@}
-*/
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Specifies dimension in which post-increment accesses advance
-struct IteratorAdvance {
-  enum Kind { kD, kH, kW };
-};
-
-/// Specifies whether iterator storage fragment consists of Scalar values or WMMA matrix
-struct IteratorFragment {
-  enum Kind { kScalar, kWmmaMatrix };
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief A template defining \ref tile_traits_concept
-* @concept{tile_traits_concept}
-*/
-template <typename Tile_, typename Delta_, typename Iterations_, typename ThreadOffset_>
-struct TileTraits {
-  /// Shape of the tile
-  typedef Tile_ Tile;
-
-  /// Number of steps between accesses along each dimension
-  typedef Delta_ Delta;
-
-  /// Number of accesses performed
-  typedef Iterations_ Iterations;
-
-  /// Functor that returns the logical coordinate of each entity's initial offset in the tile
-  typedef ThreadOffset_ ThreadOffset;
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Iterator for accessing a stripmined tile in memory
-template <typename Traits_,
-          typename Scalar_,
-          IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
-          MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
-          typename Index_ = int,
-          typename FragmentElement_ = Scalar_,
-          IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
-          typename Skew_ = Shape<0, 0, 0, 0> >
-struct TileIteratorBase {
-  /// concept TileTraits
-  typedef Traits_ Traits;
-
-  /// Scalar element
-  typedef Scalar_ Scalar;
-
-  /// Fragment element
-  typedef FragmentElement_ FragmentElement;
-
-  /// Specifies dimension in which post-increment accesses advance.
-  static IteratorAdvance::Kind const kAdvance = Advance_;
-
-  /// Specifies iterator storage fragment type (Scalar or WmmaMatrix)
-  static IteratorFragment::Kind const kIteratorFragment = IteratorFragment_;
-
-  /// Source or destination memory space
-  static MemorySpace::Kind const kMemorySpace = MemorySpace;
-
-  /// Index type
-  typedef Index_ Index;
-
-  /// Skew quantity
-  typedef Skew_ Skew;
-
-  /// Tile shape
-  typedef typename Traits::Tile Tile;
-
-  /// Distance along each dimension
-  typedef typename Traits::Delta Delta;
-
-  /// The strides in each dimension between different loads/stores.
-  typedef typename Traits::ImmediateOffsetStrides ImmediateOffsetStrides;
-
-  /// Iterations
-  typedef typename Traits::Iterations Iterations;
-
-  /// Thread offset
-  typedef typename Traits::ThreadOffset ThreadOffset;
-
-  /// The number of scalars accessed per load/store.
-  static int const kAccessSize = Tile::kC;
-
-  /// The elements loaded/store by one instruction.
-  typedef typename Vectorize<FragmentElement, kAccessSize>::Type AccessType;
-
-  /// The size of storage needed per fragment
-  static int const kFragmentSize =
-      (kIteratorFragment == IteratorFragment::kWmmaMatrix ? 16 : sizeof(AccessType));
-  /// The storage.
-  typedef Fragment<Scalar, ShapeCount<Tile>::kCount, kFragmentSize> Storage;
-  /// The fragment.
-  typedef Fragment<FragmentElement, ShapeCount<Iterations>::kCount * kAccessSize> Fragment;
-  /// The fragment iterator.
-  typedef FragmentIterator<Fragment, Iterations, AccessType> FragmentIterator;
-  /// The fragment const iterator.
-  typedef FragmentConstIterator<Fragment, Iterations, AccessType> FragmentConstIterator;
-  /// The shape of the fragment.
-  typedef typename FragmentIterator::FragmentShape FragmentShape;
-
-  /// Default predicate mask type
-  typedef PredicateVector<ShapeCount<Iterations>::kCount> PredicateVector;
-
-  //
-  // Params struct
-  //
-
-  /// Parameters to the iterator
-  struct Params {
-    Index stride_d;
-    Index stride_h;
-    Index stride_w;
-
-    Index inc_d;
-    Index inc_h;
-    Index inc_w;
-
-    Index inc_advance;
-
-    /// Initializes params
-    CUTLASS_HOST_DEVICE
-    int initialize(Index _stride_d,
-                   Index _stride_h,
-                   Index _stride_w,
-                   Index _inc_d,
-                   Index _inc_h,
-                   Index _inc_w,
-                   Index _inc_advance) {
-      stride_d = _stride_d;
-      stride_h = _stride_h;
-      stride_w = _stride_w;
-
-      inc_d = _inc_d;
-      inc_h = _inc_h;
-      inc_w = _inc_w;
-      inc_advance = _inc_advance;
-
-      return 0;
-    }
-
-    CUTLASS_HOST_DEVICE
-    int initialize(Index _stride_d, Index _stride_h, Index _stride_w) {
-      stride_d = _stride_d;
-      stride_h = _stride_h;
-      stride_w = _stride_w;
-
-      inc_w = stride_w * Delta::kW;
-      inc_h = stride_h * Delta::kH - stride_w * Delta::kW * (Iterations::kW - 1);
-
-      if (kAdvance == IteratorAdvance::kH) {
-        // Advance in the H dimension.
-        inc_d = 0;
-      } else if (kAdvance == IteratorAdvance::kW) {
-        // Advance in the W dimension.
-        inc_d = stride_w * Tile::kW - stride_h * Tile::kH;
-      } else {
-        // Advance in the D dimension.
-        inc_d = stride_d;
-      }
-
-      inc_advance = 0;
-
-      return 0;
-    }
-
-    CUTLASS_HOST_DEVICE int initialize() {
-      stride_d = 0;
-      stride_h = 0;
-      stride_w = 1;
-
-      inc_d = inc_h = inc_w = inc_advance = 0;
-
-      return 0;
-    }
-  };
-
-  /// Is the iterator valid?
-  CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
-
-  //
-  // Static function members
-  //
-
-  /// Initializes a predicate vector
-  template <typename PredicateIterator>
-  CUTLASS_DEVICE static void initialize_predicates(PredicateIterator predicate_it,
-                                                   Coord<3> const &bounds,
-                                                   Coord<3> const &offset = make_Coord(0, 0, 0)) {
-    for (int d = 0; d < Iterations::kD; ++d) {
-      bool enable_d = (d * Delta::kD + offset[0] < bounds[0]);
-      for (int h = 0; h < Iterations::kH; ++h) {
-        bool enable_h = (h * Delta::kH + offset[1] < bounds[1]);
-        for (int w = 0; w < Iterations::kW; ++w) {
-          bool enable_w = (w * Tile::kC * Delta::kW + offset[2] < bounds[2]);
-          predicate_it.set(d, h, w, 0, enable_d && enable_h && enable_w);
-        }
-      }
-    }
-  }
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup tile_load_iterator_concept Tile Load Iterator Concept
-@{
-
-\ref tile_load_iterator_concept enables loading a tile from addressable memory into a fragment
-
-@par Tile Load Iterator Concept
-  Types satisfying \ref tile_load_iterator_concept define the following members
-  - <b>PredicateVector</b> - a \ref predicate_vector_concept with sufficient predicate storage for
-each access implied by the tile traits
-  - <b>Fragment</b> - the destination fragment type satisfying \ref fragment_concept
-  - <b>initialize_predicates(pred_it, bounds, block_offset)</b> - function initializing a predicate
-vector according to externally specified bounds
-  - <b>load_post_increment(fragment, pred_it)</b> - a method that loads a fragment and increments
-the iterator to the next tile, guarded by a \ref predicate_iterator_concept
-  - <b>load_post_increment(fragment)</b> - a method that loads a fragment and increments the
-iterator to the next tile
-  - <b>load(fragment, pred_it)</b> - a const method that loads a fragment, guarded by a \ref
-predicate_iterator_concept
-  - <b>load(fragment)</b> - a method that loads a fragment
-
-@}
-*/
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief An iterator implementing \ref tile_load_iterator_concept for loading a tile from memory
-* @concept{tile_load_iterator_concept}
-*/
-template <typename Traits_,
-          typename Scalar_,
-          IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
-          MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
-          typename Index_ = int,
-          typename FragmentElement_ = Scalar_,
-          IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
-          typename Skew_ = Shape<0, 0, 0, 0> >
-struct TileLoadIterator : public TileIteratorBase<Traits_,
-                                                  Scalar_,
-                                                  Advance_,
-                                                  MemorySpace,
-                                                  Index_,
-                                                  FragmentElement_,
-                                                  IteratorFragment_,
-                                                  Skew_> {
-  /// Base class
-  typedef TileIteratorBase<Traits_,
-                           Scalar_,
-                           Advance_,
-                           MemorySpace,
-                           Index_,
-                           FragmentElement_,
-                           IteratorFragment_,
-                           Skew_>
-      Base;
-
-  /// concept TileTraits
-  typedef typename Base::Traits Traits;
-
-  /// Scalar element
-  typedef typename Base::Scalar Scalar;
-
-  /// Fragment element
-  typedef typename Base::FragmentElement FragmentElement;
-
-  /// Specifies in which dimension post-increment accesses advance.
-  static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
-
-  /// Specifies type of iterator fragment storage (Salar or WmmaMatrix)
-  static IteratorFragment::Kind const kIteratorFragment = Base::kIteratorFragment;
-
-  /// Source or destination memory space
-  static MemorySpace::Kind const kMemorySpace = Base::kMemorySpace;
-
-  /// Index type
-  typedef typename Base::Index Index;
-
-  /// Skew quantity
-  typedef typename Base::Skew Skew;
-
-  /// Tile shape
-  typedef typename Base::Tile Tile;
-
-  /// Delta
-  typedef typename Base::Delta Delta;
-
-  /// Iterations
-  typedef typename Base::Iterations Iterations;
-
-  /// ThreadOffset functor
-  typedef typename Base::ThreadOffset ThreadOffset;
-
-  /// Fragment type
-  typedef typename Base::FragmentShape FragmentShape;
-
-  /// Memory access type
-  typedef typename Base::AccessType AccessType;
-
-  /// Fragment definition
-  typedef typename Base::Fragment Fragment;
-
-  /// Fragment iterator definition
-  typedef typename Base::FragmentIterator FragmentIterator;
-
-  /// Fragment const iterator definition
-  typedef typename Base::FragmentConstIterator FragmentConstIterator;
-
-  /// Default predicate mask type
-  typedef typename Base::PredicateVector PredicateVector;
-
-  /// Storage object that may be loaded from
-  typedef typename Base::Storage SharedStorage;
-
-  /// IteratorBase parameters
-  typedef typename Base::Params BaseParams;
-
-  /// Do we require a fence?
-  enum { kRequiresLoadFence = Tile::kD == 1 };
-
-  /// The pointer type
-  typedef Scalar const *Pointer;
-
-  /// Parameters
-  struct Params : public BaseParams {
-    /// Pointer to memory
-    Scalar const *pointer;
-
-    /// Initialize params to access storage object
-    CUTLASS_HOST_DEVICE
-    int initialize(SharedStorage const &storage) {
-      pointer = &storage[0];
-      return 0;
-    }
-
-    /// Initializes params to access a raw pointer
-    CUTLASS_HOST_DEVICE
-    int initialize(Scalar const *ptr, Index stride_d, Index stride_h, Index stride_w) {
-      Base::Params::initialize(stride_d, stride_h, stride_w);
-      pointer = ptr;
-      return 0;
-    }
-
-    /// Initializes params
-    CUTLASS_HOST_DEVICE
-    int initialize(Scalar const *ptr,
-                   Index _stride_d,
-                   Index _stride_h,
-                   Index _stride_w,
-                   Index _inc_d,
-                   Index _inc_h,
-                   Index _inc_w,
-                   Index _inc_advance) {
-      pointer = ptr;
-      Base::Params::initialize(
-          _stride_d, _stride_h, _stride_w, _inc_d, _inc_h, _inc_w, _inc_advance);
-      return 0;
-    }
-
-    // Initializes params to default values
-    CUTLASS_HOST_DEVICE
-    int initialize() { return Base::Params::initialize(); }
-  };
-
-  //
-  // Data members
-  //
-
-  /// Parameters structure
-  Params params;
-
-  /// Offset of an individual lane from the start of the tile
-  Coord<4> thread_offset;
-
-  /// Stage argument enables wrapping after some number of tiles have been loaded.
-  int stage;
-
-  //
-  // Static member functions
-  //
-
-  /// Initializes a predicate vector
-  template <typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void initialize_predicates(PredicateIterator predicate_it,
-                                                 Coord<3> const &bounds,
-                                                 Coord<3> const &block_offset = make_Coord(0,
-                                                                                           0,
-                                                                                           0)) {
-    Base::initialize_predicates(
-        predicate_it,
-        bounds,
-        block_offset + make_Coord(0, thread_offset[1], thread_offset[2] * Tile::kC));
-  }
-
-  //
-  // Methods
-  //
-
-  /// Default constructor
-  CUTLASS_HOST_DEVICE
-  TileLoadIterator() {}
-
-  /// Constructs a tile load iterator
-  CUTLASS_HOST_DEVICE
-  TileLoadIterator(Params const &_params,
-                   Coord<3> const &block_offset = make_Coord(0, 0, 0),
-                   ThreadOffset thread_offset_func = ThreadOffset())
-      : params(_params), stage(0) {
-    thread_offset = thread_offset_func();
-
-    Index block_offset_h = 0;
-    Index block_offset_w = 0;
-    if (kAdvance == IteratorAdvance::kH) {
-      block_offset_h = block_offset[1];
-      block_offset_w = block_offset[2];
-    } else {
-      block_offset_h = block_offset[2];
-      block_offset_w = block_offset[1];
-    }
-
-    params.pointer += block_offset[0] * params.stride_d +
-                      (block_offset_h + thread_offset[1]) * params.stride_h +
-                      (block_offset_w + thread_offset[2] * Tile::kC) / Tile::kC * params.stride_w;
-  }
-
-  /// Constructs a tile load iterator
-  CUTLASS_HOST_DEVICE
-  TileLoadIterator(Params const &,
-                   SharedStorage &shared_storage,
-                   Coord<3> const &block_offset = make_Coord(0, 0, 0),
-                   ThreadOffset thread_offset_func = ThreadOffset())
-      : stage(0) {
-    int const offset = thread_offset_func()[2];
-    params.pointer = &shared_storage[offset];
-  }
-
-  /// Returns the current pointer
-  CUTLASS_HOST_DEVICE
-  Scalar const *data() const { return params.pointer; }
-
-  /// Increment in the D dimension
-  CUTLASS_HOST_DEVICE void inc_d() { params.pointer += params.inc_d; }
-
-  /// Increment in the H dimension
-  CUTLASS_HOST_DEVICE void inc_h() { params.pointer += params.inc_h; }
-
-  /// Increment in the W dimension
-  CUTLASS_HOST_DEVICE void inc_w() { params.pointer += params.inc_w; }
-
-  /// Increment in the next dimension
-  CUTLASS_HOST_DEVICE void inc_advance() { params.pointer += params.inc_advance; }
-
-  /// Increment the stage.
-  CUTLASS_DEVICE void inc_stage() {
-    if (Tile::kD > 1) {
-      int const kStageSize = Tile::kH * Tile::kW * Tile::kC;
-      if (stage == Tile::kD - 1) {
-        params.pointer -= (Tile::kD - 1) * kStageSize;
-        stage = 0;
-      } else {
-        params.pointer += kStageSize;
-        stage = stage + 1;
-      }
-    }
-  }
-
- public:
-  /// Loads a fragment and advances the iterator to the next tile.
-  template <typename Fragment, typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void load_post_increment(Fragment &fragment, PredicateIterator pred_it) {
-    FragmentIterator frag_iterator(fragment);
-
-    for (int d = 0; d < Iterations::kD; ++d) {
-      for (int h = 0; h < Iterations::kH; ++h) {
-        for (int w = 0; w < Iterations::kW; ++w, ++pred_it) {
-          if (*pred_it) {
-            Load<typename Fragment::Element, Tile::kC, kMemorySpace>::load(
-                reinterpret_cast<AccessType &>(frag_iterator.at(d, h, w, 0)), data(), 0);
-          }
-
-          if (w < Iterations::kW - 1) {
-            inc_w();
-          }
-        }
-        if (h < Iterations::kH - 1) {
-          inc_h();
-        }
-      }
-      if (d < Iterations::kD - 1) {
-        inc_d();
-      }
-    }
-    inc_advance();
-  }
-
-  /// Loads a fragment and advances the iterator to the next tile.
-  template <typename Fragment>
-  CUTLASS_HOST_DEVICE void load_post_increment(Fragment &fragment) {
-    typename PredicateVector::TrivialIterator pred_it;
-    load_post_increment(fragment, pred_it);
-  }
-
-  /// Loads a fragment without advancing the iterator..
-  template <typename Fragment, typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void load(Fragment &fragment, PredicateIterator pred_it) const {
-    TileLoadIterator _load_it(*this);
-    _load_it.load_post_increment(fragment, pred_it);
-  }
-
-  /// Loads a fragment without advancing the iterator..
-  template <typename Fragment>
-  CUTLASS_HOST_DEVICE void load(Fragment &fragment) const {
-    typename PredicateVector::TrivialIterator pred_it;
-    load(fragment, pred_it);
-  }
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/*!@defgroup tile_store_iterator_concept Tile Store Iterator Concept
-@{
-
-\ref tile_store_iterator_concept enables storing a tile to addressable memory
-
-@par Tile Store Iterator Concept
-  Types satisfying \ref tile_load_iterator_concept define the following members
-  - <b>PredicateVector</b> - a \ref predicate_vector_concept with sufficient predicate storage for
-each access implied by the tile traits
-  - <b>Fragment</b> - the destination fragment type satisfying \ref fragment_concept
-  - <b>initialize_predicates(pred_it, bounds, block_offset)</b> - function initializing a predicate
-vector according to externally specified bounds
-  - <b>store_post_increment(fragment, pred_it)</b> - a method that stores a fragment and increments
-the iterator to the next tile, guarded by a \ref predicate_iterator_concept
-  - <b>store_post_increment(fragment)</b> - a method that stores a fragment and increments the
-iterator to the next tile
-  - <b>store(fragment, pred_it)</b> - a const method that stores a fragment, guarded by a \ref
-predicate_iterator_concept
-  - <b>store(fragment)</b> - a method that loads a fragment
-
-@}
-*/
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/**
-* @brief An iterator implementing \ref tile_store_iterator_concept for storing a tile to memory
-* @concept{tile_store_iterator_concept}
-*/
-template <typename Traits_,
-          typename Scalar_,
-          IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
-          MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
-          typename Index_ = int,
-          typename FragmentElement_ = Scalar_,
-          IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
-          typename Skew_ = Shape<0, 0, 0, 0> >
-struct TileStoreIterator : public TileIteratorBase<Traits_,
-                                                   Scalar_,
-                                                   Advance_,
-                                                   MemorySpace,
-                                                   Index_,
-                                                   FragmentElement_,
-                                                   IteratorFragment_,
-                                                   Skew_> {
-  /// Base class
-  typedef TileIteratorBase<Traits_,
-                           Scalar_,
-                           Advance_,
-                           MemorySpace,
-                           Index_,
-                           FragmentElement_,
-                           IteratorFragment_,
-                           Skew_>
-      Base;
-
-  /// concept TileTraits
-  typedef typename Base::Traits Traits;
-
-  /// Scalar element
-  typedef typename Base::Scalar Scalar;
-
-  /// Fragment element
-  typedef typename Base::FragmentElement FragmentElement;
-
-  /// Specifies in which dimension post-increment accesses advance.
-  static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
-
-  /// Specifies type of iterator fragment storage (Salar or WmmaMatrix)
-  static IteratorFragment::Kind const kIteratorFragment = Base::kIteratorFragment;
-
-  /// Source or destination memory space
-  static MemorySpace::Kind const kMemorySpace = Base::kMemorySpace;
-
-  /// Index type
-  typedef typename Base::Index Index;
-
-  /// Skew quantity
-  typedef typename Base::Skew Skew;
-
-  /// Tile shape
-  typedef typename Base::Tile Tile;
-
-  /// Delta
-  typedef typename Base::Delta Delta;
-
-  /// Iterations
-  typedef typename Base::Iterations Iterations;
-
-  /// ThreadOffset functor
-  typedef typename Base::ThreadOffset ThreadOffset;
-
-  /// Fragment type
-  typedef typename Base::FragmentShape FragmentShape;
-
-  /// Memory access type
-  typedef typename Base::AccessType AccessType;
-
-  /// Fragment definition
-  typedef typename Base::Fragment Fragment;
-
-  /// Fragment iterator definition
-  typedef typename Base::FragmentIterator FragmentIterator;
-
-  /// Fragment const iterator definition
-  typedef typename Base::FragmentConstIterator FragmentConstIterator;
-
-  /// Default predicate mask type
-  typedef typename Base::PredicateVector PredicateVector;
-
-  /// Storage object which may be stored to
-  typedef typename Base::Storage SharedStorage;
-
-  /// IteratorBase parameters
-  typedef typename Base::Params BaseParams;
-
-  /// Parameters
-  struct Params : public BaseParams {
-    /// Pointer to memory
-    Scalar *pointer;
-
-    /// Initialize params to access storage object
-    CUTLASS_HOST_DEVICE
-    int initialize(SharedStorage &storage) {
-      pointer = &storage[0];
-      return 0;
-    }
-
-    /// Initializes params to access a raw pointer
-    CUTLASS_HOST_DEVICE
-    int initialize(Scalar *ptr, Index stride_d, Index stride_h, Index stride_w) {
-      Base::Params::initialize(stride_d, stride_h, stride_w);
-      pointer = ptr;
-      return 0;
-    }
-
-    /// Initializes params
-    CUTLASS_HOST_DEVICE
-    int initialize(Scalar *ptr,
-                   Index _stride_d,
-                   Index _stride_h,
-                   Index _stride_w,
-                   Index _inc_d,
-                   Index _inc_h,
-                   Index _inc_w,
-                   Index _inc_advance) {
-      pointer = ptr;
-      Base::Params::initialize(
-          _stride_d, _stride_h, _stride_w, _inc_d, _inc_h, _inc_w, _inc_advance);
-      return 0;
-    }
-
-    /// Initializes params to default values
-    CUTLASS_HOST_DEVICE
-    int initialize() { return Base::Params::initialize(); }
-  };
-
-  //
-  // Data members
-  //
-
-  /// Parameters structure
-  Params params;
-
-  /// Offset of an individual lane from the start of the tile
-  Coord<4> thread_offset;
-
-  /// The stage.
-  int stage;
-
-  //
-  // Static member functions
-  //
-
-  /// Initializes a predicate vector
-  template <typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void initialize_predicates(PredicateIterator predicate_it,
-                                                 Coord<3> const &bounds,
-                                                 Coord<3> const &block_offset = make_Coord(0,
-                                                                                           0,
-                                                                                           0)) {
-    Base::initialize_predicates(
-        predicate_it,
-        bounds,
-        block_offset + make_Coord(0, thread_offset[1], thread_offset[2] * Tile::kC));
-  }
-
-  //
-  // Methods
-  //
-
-  /// Default constructor
-  CUTLASS_HOST_DEVICE
-  TileStoreIterator() {}
-
-  /// Constructs a tile store iterator
-  CUTLASS_HOST_DEVICE
-  TileStoreIterator(Params const &_params,
-                    Coord<3> const &block_offset = make_Coord(0, 0, 0),
-                    ThreadOffset thread_offset_func = ThreadOffset())
-      : params(_params), stage(0) {
-    thread_offset = thread_offset_func();
-
-    params.pointer += block_offset[0] * params.stride_d +
-                      (block_offset[1] + thread_offset[1]) * params.stride_h +
-                      (block_offset[2] + thread_offset[2] * Tile::kC) / Tile::kC * params.stride_w;
-  }
-
-  /// Constructs a tile store iterator
-  CUTLASS_HOST_DEVICE
-  TileStoreIterator(Params const &,
-                    SharedStorage &shared_storage,
-                    Coord<3> const &block_offset = make_Coord(0, 0, 0),
-                    ThreadOffset thread_offset_func = ThreadOffset())
-      : stage(0) {
-    int const offset = thread_offset_func()[2];
-    params.pointer = &shared_storage[offset];
-  }
-
-  /// Returns the current pointer
-  CUTLASS_HOST_DEVICE
-  Scalar *data() const { return params.pointer; }
-
-  /// Increment in the D dimension
-  CUTLASS_HOST_DEVICE void inc_d() { params.pointer += params.inc_d; }
-
-  /// Increment in the H dimension
-  CUTLASS_HOST_DEVICE void inc_h() { params.pointer += params.inc_h; }
-
-  /// Increment in the W dimension
-  CUTLASS_HOST_DEVICE void inc_w() { params.pointer += params.inc_w; }
-
-  /// Increment in the next dimension
-  CUTLASS_HOST_DEVICE void inc_advance() {}
-
-  /// Increment the stage.
-  CUTLASS_DEVICE void inc_stage() {
-    if (Tile::kD > 1) {
-      int const kStageSize = Tile::kH * Tile::kW * Tile::kC;
-      if (stage == Tile::kD - 1) {
-        params.pointer -= (Tile::kD - 1) * kStageSize;
-        stage = 0;
-      } else {
-        params.pointer += kStageSize;
-        stage = stage + 1;
-      }
-    }
-  }
-
- public:
-  /// Stores a fragment and advances to the next tile.
-  template <typename Fragment, typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void store_post_increment(Fragment &fragment, PredicateIterator pred_it) {
-    FragmentIterator frag_iterator(fragment);
-
-    for (int d = 0; d < Iterations::kD; ++d) {
-      for (int h = 0; h < Iterations::kH; ++h) {
-        for (int w = 0; w < Iterations::kW; ++w, ++pred_it) {
-          if (*pred_it) {
-            Store<typename Fragment::Element, Tile::kC, kMemorySpace>::store(
-                reinterpret_cast<AccessType &>(frag_iterator.at(d, h, w, 0)), data(), 0);
-          }
-          if (w < Iterations::kW - 1) {
-            inc_w();
-          }
-        }
-        if (h < Iterations::kH - 1) {
-          inc_h();
-        }
-      }
-      if (d < Iterations::kD - 1) {
-        inc_d();
-      }
-    }
-    inc_advance();
-  }
-
-  /// Stores a fragment and advances to the next tile.
-  template <typename Fragment>
-  CUTLASS_HOST_DEVICE void store_post_increment(Fragment &fragment) {
-    typename PredicateVector::TrivialIterator pred_it;
-    store_post_increment(fragment, pred_it);
-  }
-
-  /// Stores a fragment without advancing the iterator.
-  template <typename Fragment, typename PredicateIterator>
-  CUTLASS_HOST_DEVICE void store(Fragment &fragment, PredicateIterator pred_it) const {
-    TileStoreIterator _store_it(*this);
-    _store_it.store_post_increment(fragment, pred_it);
-  }
-
-  /// Stores a fragment without advancing the iterator.
-  template <typename Fragment>
-  CUTLASS_HOST_DEVICE void store(Fragment &fragment) const {
-    typename PredicateVector::TrivialIterator pred_it;
-    store(fragment, pred_it);
-  }
-};
-}
--- a/cutlass/tile_traits_standard.h
+++ b/cutlass/tile_traits_standard.h
@ -1,238 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines tile traits for several tile partitioning arrangements of threads expected to
-      achieve efficient streaming performance.
-*/
-#pragma once
-
-#include <cutlass/tile_iterator.h>
-
-namespace cutlass {
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Basic thread offset function computed from a thread shape
-template <typename ThreadShape>
-struct TiledThreadOffset {
-  /// Computes the logical coordinate from thread shape
-  CUTLASS_HOST_DEVICE
-  Coord<4> operator()() const {
-    Coord<4> thread_offset;
-
-    int index = threadIdx.x;
-
-    thread_offset[3] = (index % ThreadShape::kC);
-    index = (index / ThreadShape::kC);
-
-    thread_offset[2] = (index % ThreadShape::kW);
-    index = (index / ThreadShape::kW);
-
-    thread_offset[1] = (index % ThreadShape::kH);
-    index = (index / ThreadShape::kH);
-
-    thread_offset[0] = index;
-
-    return thread_offset;
-  }
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Tiling in which the number of threads is greater than the
-/// contiguous dimension of the tile.
-template <typename Tile_, int Threads>
-struct TileTraitsStrideMajor {
-  /// Shape of tile
-  typedef Tile_ Tile;
-
-  /// Number of participating threads
-  static int const kThreads = Threads;
-
-  // Static assertions
-  static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
-                "Tiling undefined if elements not divisible by threads.");
-
-  static_assert(Tile::kW <= kThreads,
-                "This specialization assumes there are more threads than the contiguous dimension "
-                "of the tile.");
-
-  /// Shape of threads
-  typedef Shape<1, kThreads / Tile::kW, Tile::kW, 1> ThreadShape;
-
-  /// Delta along each dimension
-  typedef Shape<1, ThreadShape::kH, 1, 1> Delta;
-
-  /// Number of iterations
-  typedef Shape<1, Tile::kH / ThreadShape::kH, 1, 1> Iterations;
-
-  /// Computes the initial offset
-  typedef TiledThreadOffset<ThreadShape> ThreadOffset;
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Tiling in which the number of threads is fewer than the tile size
-/// in the contiguous dimension.
-template <typename Tile_, int Threads>
-struct TileTraitsContiguousMajor {
-  /// Shape of tile
-  typedef Tile_ Tile;
-
-  /// Number of participating threads
-  static int const kThreads = Threads;
-
-  // Static assertions
-  static_assert(Tile::kW >= kThreads,
-                "This specialization assumes there are more threads than the contiguous dimension "
-                "of the tile.");
-
-  static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
-                "Tiling undefined if elements not divisible by threads.");
-
-  static_assert(!(Tile::kW % kThreads),
-                "The contiguous size of the tile must be divisible by the number of threads.");
-
-  /// Thread shape
-  typedef Shape<1, 1, kThreads> ThreadShape;
-
-  /// Delta between each thread's access
-  typedef Shape<1, 1, kThreads> Delta;
-
-  /// Number of iterations
-  typedef Shape<1, Tile::kH, Tile::kW / kThreads> Iterations;
-
-  /// Computes the initial offset
-  typedef TiledThreadOffset<ThreadShape> ThreadOffset;
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Tiling in which warps rake across the contiguous dimension
-template <typename Tile_, int Threads>
-struct TileTraitsWarpRake {
-  /// Shape of tile
-  typedef Tile_ Tile;
-
-  /// Number of participating threads
-  static int const kThreads = Threads;
-
-  /// Hard-coded warp size
-  static int const kWarpSize = 32;
-
-  /// Number of participating warps
-  static int const kWarpCount = kThreads / kWarpSize;
-
-  // Static assertions
-  static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
-                "Tiling undefined if elements not divisible by threads.");
-
-  static_assert(!(kThreads % kWarpSize), "Number of threads must be divisible by the warp size.");
-
-  static_assert(!(Tile::kW % kWarpSize), "Contiguous dimension must be divisible by the warp size");
-
-  /// Warps strip-mined across strided dimension
-  static int const kWarpsStrided = __NV_STD_MIN(kWarpCount, Tile::kH);
-
-  /// Warps stripmined contiguous dimension
-  static int const kWarpsContiguous = kWarpCount / kWarpsStrided;
-
-  /// Arrangement of threads
-  typedef Shape<1, kWarpsStrided, kWarpsContiguous * kWarpSize> ThreadShape;
-
-  /// The same warp rakes along the contiguous dimension
-  typedef Shape<1, kWarpsStrided, kWarpSize> Delta;
-
-  /// Number of iterations
-  typedef Shape<1, Tile::kH / Delta::kH, Tile::kW / ThreadShape::kW> Iterations;
-
-  /// Computes the thread offset in (H, W) based on thread ID
-  struct ThreadOffset {
-    /// Basic thread offset function computed from a thread shape
-    CUTLASS_HOST_DEVICE
-    Coord<4> operator()() const {
-      int tid = threadIdx.x;
-      int warp = (tid / kWarpSize);
-      int lane = (tid % kWarpSize);
-
-      static int const kWarpSpanContiguous = kWarpSize * Iterations::kW;
-
-      int warp_w = (warp % kWarpsContiguous);
-      int warp_h = (warp / kWarpsContiguous);
-
-      return make_Coord(0, warp_h, lane + kWarpSpanContiguous * warp_w, 0);
-    }
-  };
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Chooses 'best' shape to enable warp raking along contiguous dimension if possible.
-template <typename Tile_, int Threads>
-struct TileTraitsStandard {
-  /// Shape of tile
-  typedef Tile_ Tile;
-
-  /// Number of participating threads
-  static int const kThreads = Threads;
-
-  /// Hard-coded warp size
-  static int const kWarpSize = 32;
-
-  /// Number of participating warps
-  static int const kWarpCount = kThreads / kWarpSize;
-
-  // Static assertions
-  static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
-                "Tiling undefined if elements not divisible by threads.");
-
-  /// Choose the stride-major contiguous tiling if the contiguous dimension is
-  /// smaller than the warp size. Otherwise, if it is divisible by the warp size,
-  /// choose the warp rake arrangement.
-  typedef typename platform::conditional <
-      Tile::kW<kWarpSize,
-               TileTraitsStrideMajor<Tile, Threads>,
-               typename platform::conditional<!(Tile::kW % kWarpSize),
-                                              TileTraitsWarpRake<Tile, Threads>,
-                                              TileTraitsContiguousMajor<Tile, Threads> >::type>::
-          type Traits;
-
-  /// Delta between accesses
-  typedef typename Traits::Delta Delta;
-
-  /// Delta between each thread's access
-  /// TODO MTA this is wrong for sure, but Delta is used for stride computation at the moment
-  typedef Delta ImmediateOffsetStrides;
-
-  /// Number of accesses
-  typedef typename Traits::Iterations Iterations;
-
-  /// Thread offset functor
-  typedef typename Traits::ThreadOffset ThreadOffset;
-};
-
-///////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/util/cutlass_math.h
+++ b/cutlass/util/cutlass_math.h
@ -1,131 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Math utilities
- */
-
-#include <cutlass/util/platform.h>
-
-namespace cutlass {
-
-/******************************************************************************
- * Static math utilities
- ******************************************************************************/
-
-/**
- * Statically determine if N is a power-of-two
- */
-template <int N>
-struct is_pow2 : platform::integral_constant<bool, (N & (N - 1)) == 0> {};
-
-/**
- * Statically determine log2(N), rounded down
- */
-template <int N, int CurrentVal = N, int Count = 0>
-struct log2_down {
-  /// Static logarithm value
-  enum { value = log2_down<N, (CurrentVal >> 1), Count + 1>::value };
-};
-
-// Base case
-template <int N, int Count>
-struct log2_down<N, 1, Count> {
-  enum { value = Count };
-};
-
-/**
- * Statically determine log2(N), rounded up
- */
-template <int N, int CurrentVal = N, int Count = 0>
-struct log2_up {
-  /// Static logarithm value
-  enum { value = log2_up<N, (CurrentVal >> 1), Count + 1>::value };
-};
-
-// Base case
-template <int N, int Count>
-struct log2_up<N, 1, Count> {
-  enum { value = ((1 << Count) < N) ? Count + 1 : Count };
-};
-
-/**
- * Statically estimate sqrt(N) to the nearest power-of-two
- */
-template <int N>
-struct sqrt_est {
-  enum { value = 1 << (log2_up<N>::value / 2) };
-};
-
-/**
- * For performing a constant-division with a compile-time assertion that the
- * Divisor evenly-divides the Dividend.
- */
-template <int Dividend, int Divisor>
-struct divide_assert {
-  enum { value = Dividend / Divisor };
-
-  static_assert((Dividend % Divisor == 0), "Not an even multiple");
-};
-
-/******************************************************************************
- * Rounding
- ******************************************************************************/
-
-/**
- * Round dividend up to the nearest multiple of divisor
- */
-template <typename dividend_t, typename divisor_t>
-CUTLASS_HOST_DEVICE dividend_t round_nearest(dividend_t dividend, divisor_t divisor) {
-  return ((dividend + divisor - 1) / divisor) * divisor;
-}
-
-/**
- * Greatest common divisor
- */
-template <typename value_t>
-CUTLASS_HOST_DEVICE value_t gcd(value_t a, value_t b) {
-  for (;;) {
-    if (a == 0) return b;
-    b %= a;
-    if (b == 0) return a;
-    a %= b;
-  }
-}
-
-/**
- * Least common multiple
- */
-template <typename value_t>
-CUTLASS_HOST_DEVICE value_t lcm(value_t a, value_t b) {
-  value_t temp = gcd(a, b);
-
-  return temp ? (a / temp * b) : 0;
-}
-
-}  // namespace cutlass
--- a/cutlass/util/debug.h
+++ b/cutlass/util/debug.h
@ -1,122 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-
-#pragma once
-
-/**
- * \file
- * \brief Debugging and logging functionality
- */
-
-#include <stdio.h>
-
-namespace cutlass {
-
-/******************************************************************************
- * Debug and logging macros
- ******************************************************************************/
-
-/**
- * Formats and prints the given message to stdout
- */
-#if !defined(CUDA_LOG)
-#if !defined(__CUDA_ARCH__)
-#define CUDA_LOG(format, ...) printf(format, __VA_ARGS__)
-#else
-#define CUDA_LOG(format, ...)                              \
-  printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, \
-         blockIdx.x,                                       \
-         blockIdx.y,                                       \
-         blockIdx.z,                                       \
-         threadIdx.x,                                      \
-         threadIdx.y,                                      \
-         threadIdx.z,                                      \
-         __VA_ARGS__);
-#endif
-#endif
-
-/**
- * Formats and prints the given message to stdout only if DEBUG is defined
- */
-#if !defined(CUDA_LOG_DEBUG)
-#ifdef DEBUG
-#define CUDA_LOG_DEBUG(format, ...) CUDA_LOG(format, __VA_ARGS__)
-#else
-#define CUDA_LOG_DEBUG(format, ...)
-#endif
-#endif
-
-/**
- * \brief The corresponding error message is printed to \p stderr (or \p stdout in device code)
- * along with the supplied source context.
- *
- * \return The CUDA error.
- */
-__host__ CUTLASS_DEVICE cudaError_t cuda_perror_impl(cudaError_t error,
-                                                     const char* filename,
-                                                     int line) {
-  (void)filename;
-  (void)line;
-  if (error) {
-#if !defined(__CUDA_ARCH__)
-    fprintf(
-        stderr, "CUDA error %d [%s, %d]: %s\n", error, filename, line, cudaGetErrorString(error));
-    fflush(stderr);
-#else
-    printf("CUDA error %d [%s, %d]\n", error, filename, line);
-#endif
-  }
-  return error;
-}
-
-/**
- * \brief Perror macro
- */
-#ifndef CUDA_PERROR
-#define CUDA_PERROR(e) cuda_perror_impl((cudaError_t)(e), __FILE__, __LINE__)
-#endif
-
-/**
- * \brief Perror macro with exit
- */
-#ifndef CUDA_PERROR_EXIT
-#define CUDA_PERROR_EXIT(e)                                     \
-  if (cuda_perror_impl((cudaError_t)(e), __FILE__, __LINE__)) { \
-    exit(1);                                                    \
-  }
-#endif
-
-/**
- * \brief Perror macro only if DEBUG is defined
- */
-#ifndef CUDA_PERROR_DEBUG
-#ifdef DEBUG
-#define CUDA_PERROR_DEBUG(e) CUDA_PERROR(e)
-#else
-#define CUDA_PERROR_DEBUG(e) (e)
-#endif
-#endif
-
-}  // namespace cutlass
--- a/cutlass/vector.h
+++ b/cutlass/vector.h
@ -1,229 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Defines a 1D vector of elements held in the registers of each thread.
-*/
-#pragma once
-
-#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
-#include <cuda_fp16.h>
-#endif
-
-#include <cutlass/util/platform.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <size_t kAlignment_>
-struct AlignedStruct {};
-
-template <>
-struct __align__(1) AlignedStruct<1>{};
-template <>
-struct __align__(2) AlignedStruct<2>{};
-template <>
-struct __align__(4) AlignedStruct<4>{};
-template <>
-struct __align__(8) AlignedStruct<8>{};
-template <>
-struct __align__(16) AlignedStruct<16>{};
-template <>
-struct __align__(32) AlignedStruct<32>{};
-template <>
-struct __align__(64) AlignedStruct<64>{};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int kLanes_>
-union Vector {
-  /// The scalar type.
-  typedef Scalar_ Scalar;
-
-  /// The number of elements in the vector.
-  enum { kLanes = kLanes_ };
-  /// The size of the vector.
-  enum { kVectorSize = kLanes * (int)sizeof(Scalar) };
-  /// The number of registers needed to store the vector.
-  enum { kRegisters = kVectorSize < 4 ? 1 : kVectorSize / 4 };
-
-  // Make sure that the vector type makes sense.
-  static_assert(kVectorSize <= 16, "Vector type is too large");
-
-  /// The aligned storage to make sure we have good alignment.
-  AlignedStruct<kVectorSize> aligned_;
-  /// The associated array of scalars.
-  Scalar scalars[kLanes];
-  /// The data in registers.
-  uint32_t registers[kRegisters];
-
-  /// Accessor to the ith lane.
-  CUTLASS_DEVICE Scalar const& operator[](uint32_t i) const { return scalars[i]; }
-  /// Accessor to the ith lane.
-  CUTLASS_DEVICE Scalar& operator[](uint32_t i) { return scalars[i]; }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
-
-template <int kLanes_>
-union Vector<half, kLanes_> {
-  /// The scalar type.
-  typedef half Scalar;
-
-  /// The number of elements in the vector.
-  enum { kLanes = kLanes_ };
-  /// The size of the vector.
-  enum { kVectorSize = kLanes * (int)sizeof(Scalar) };
-  /// The number of registers needed to store the vector.
-  enum { kRegisters = kVectorSize < 4 ? 1 : kVectorSize / 4 };
-
-  // Make sure that the vector type makes sense.
-  static_assert(kVectorSize <= size_t(16), "Vector type is too large");
-
-  /// The aligned storage to make sure we have good alignment.
-  AlignedStruct<kVectorSize> aligned_;
-  /// The associated array of scalars.
-  uint16_t scalars[kLanes];
-  /// The data in registers.
-  uint32_t registers[kRegisters];
-
-  /// Accessor to the ith lane.
-  CUTLASS_DEVICE Scalar const& operator[](uint32_t i) const {
-    return reinterpret_cast<Scalar const&>(scalars[i]);
-  }
-  /// Accessor to the ith lane.
-  CUTLASS_DEVICE Scalar& operator[](uint32_t i) { return reinterpret_cast<Scalar&>(scalars[i]); }
-};
-
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_>
-CUTLASS_DEVICE void make_zero(Scalar_& x) {
-  x = Scalar_(0);
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Element_, int kLanes_ = 1>
-struct Vectorize {
-  typedef Vector<Element_, kLanes_> Type;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Element_>
-struct Vectorize<Element_, 1> {
-  typedef Element_ Type;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <typename Scalar_, int kLanes_>
-CUTLASS_DEVICE void make_zero(Vector<Scalar_, kLanes_>& vec) {
-  for (int i = 0; i < Vector<Scalar_, kLanes_>::kRegisters; ++i) {
-    vec.registers[i] = 0;
-  }
-}
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-//
-// cutlass::Extent similar to std::extent but applicable to CUTLASS types
-//
-
-/// Returns the extent of a scalar or vector
-template <typename T>
-struct Extent {
-  static size_t const kValue = 1;
-};
-
-/// Returns the number of lanes of a vector if need be
-template <typename T, int Lanes>
-struct Extent<Vector<T, Lanes> > {
-  static size_t const kValue = Lanes;
-};
-
-/// Returns the number of lanes of a vector if need be
-template <typename T, int Lanes>
-struct Extent<Vector<T, Lanes> const> {
-  static size_t const kValue = Lanes;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Traits describing properties of vectors and scalar-as-vectors
-template <typename T>
-struct VectorTraits {
-  /// Scalar type
-  typedef T Scalar;
-
-  /// Number of lanes of vector
-  static int const kLanes = 1;
-
-  /// True if the type is actually a cutlass::Vector, otherwise false
-  static bool const IsVector = false;
-
-  /// Type that is always a vector
-  typedef Vector<T, 1> Vector;
-};
-
-/// Partial specialization for actual cutlass::Vector
-template <typename T, int Lanes>
-struct VectorTraits<Vector<T, Lanes> > {
-  /// Scalar type
-  typedef T Scalar;
-
-  /// Number of lanes of vector
-  static int const kLanes = Lanes;
-
-  /// Type is actually a cutlass::Vector
-  static bool const IsVector = true;
-
-  /// Type that is always a Vector
-  typedef Vector<T, Lanes> Vector;
-};
-
-/// Partial specialization for actual cutlass::Vector
-template <typename T, int Lanes>
-struct VectorTraits<Vector<T, Lanes> const> {
-  /// Scalar type
-  typedef T Scalar;
-
-  /// Number of lanes of vector
-  static int const kLanes = Lanes;
-
-  /// Type is actually a cutlass::Vector
-  static bool const IsVector = true;
-
-  /// Type that is always a Vector
-  typedef Vector<T, Lanes> Vector;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
--- a/cutlass/wmma_matrix.h
+++ b/cutlass/wmma_matrix.h
@ -1,193 +0,0 @@
-/***************************************************************************************************
- * Copyright (c) 2017-2018, NVIDIA CORPORATION.  All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without modification, are permitted
- * provided that the following conditions are met:
- *     * Redistributions of source code must retain the above copyright notice, this list of
- *       conditions and the following disclaimer.
- *     * Redistributions in binary form must reproduce the above copyright notice, this list of
- *       conditions and the following disclaimer in the documentation and/or other materials
- *       provided with the distribution.
- *     * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
- *       to endorse or promote products derived from this software without specific prior written
- *       permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- **************************************************************************************************/
-/*! \file
-    \brief Abstractions for loading and storing matrices using the CUDA WMMA API.
-*/
-#pragma once
-
-#if defined(__CUDACC__) && (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 700)
-
-// Dependent header files should use the following macro to guard all code using
-// nvcuda::wmma:: to enable compilation for CUDA Compute Capabilities < sm_70.
-// Earlier shader models not support Tensor Cores.
-#define CUTLASS_USE_WMMA_API
-
-#include "stdio.h"
-
-#include <crt/mma.h>
-#include <cutlass/fragment.h>
-#include <cutlass/load_store.h>
-#include <cutlass/matrix_traits.h>
-#include <cutlass/shape.h>
-#include <cutlass/vector.h>
-
-namespace cutlass {
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Statically maps cutlass::MatrixLayout => nvcuda::wmma layout tags
-template <MatrixLayout::Kind kLayout_>
-struct WmmaLayout {
-  typedef nvcuda::wmma::col_major Layout;
-};
-
-/// Statically maps cutlass::MatrixLayout => nvcuda::wmma layout tags
-template <>
-struct WmmaLayout<MatrixLayout::kRowMajor> {
-  typedef nvcuda::wmma::row_major Layout;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to nvcuda::wmma fragment load and store operations
-template <GemmOperand::Kind kOperand_,
-          MatrixLayout::Kind kLayout_,
-          typename Scalar_,
-          typename WmmaShape_>
-struct WmmaMatrix {};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to nvcuda::wmma fragment accessors for A operand
-template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
-struct WmmaMatrix<GemmOperand::kA, kLayout_, Scalar_, WmmaShape_>
-    : public nvcuda::wmma::fragment<
-          /// The nvcuda::wmma operand name.
-          nvcuda::wmma::matrix_a,
-          /// The dimensions.
-          WmmaShape_::kW,
-          WmmaShape_::kH,
-          WmmaShape_::kD,
-          /// The scalar.
-          Scalar_,
-          /// The layout.
-          typename WmmaLayout<kLayout_>::Layout> {
-  /// This type.
-  typedef WmmaMatrix<GemmOperand::kA, kLayout_, Scalar_, WmmaShape_> This_;
-
-  /// Fill-in the element.
-  CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
-    nvcuda::wmma::fill_fragment(*this, x);
-    return *this;
-  }
-
-  /// Load from memory.
-  CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
-    nvcuda::wmma::load_matrix_sync(*this, pointer, stride);
-  }
-
-  /// Store to memory.
-  CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
-    nvcuda::wmma::store_matrix_sync(pointer, *this, stride);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to nvcuda::wmma fragment accessors for B operand
-template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
-struct WmmaMatrix<GemmOperand::kB, kLayout_, Scalar_, WmmaShape_>
-    : public nvcuda::wmma::fragment<
-          /// The nvcuda::wmma operand name.
-          nvcuda::wmma::matrix_b,
-          /// The dimensions.
-          WmmaShape_::kW,
-          WmmaShape_::kH,
-          WmmaShape_::kD,
-          /// The scalar.
-          Scalar_,
-          /// The layout.
-          typename WmmaLayout<kLayout_>::Layout> {
-  /// This type.
-  typedef WmmaMatrix<GemmOperand::kB, kLayout_, Scalar_, WmmaShape_> This_;
-
-  /// Fill-in the element.
-  CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
-    nvcuda::wmma::fill_fragment(*this, x);
-    return *this;
-  }
-
-  /// Load from memory.
-  CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
-    nvcuda::wmma::load_matrix_sync(*this, pointer, stride);
-  }
-
-  /// Store to memory.
-  CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
-    nvcuda::wmma::store_matrix_sync(pointer, *this, stride);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-/// Adapter to nvcuda::wmma fragment accessors for C operand
-template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
-struct WmmaMatrix<GemmOperand::kC, kLayout_, Scalar_, WmmaShape_>
-    : public nvcuda::wmma::fragment<
-          /// The nvcuda::wmma operand name.
-          nvcuda::wmma::accumulator,
-          /// The dimensions.
-          WmmaShape_::kW,
-          WmmaShape_::kH,
-          WmmaShape_::kD,
-          /// The scalar.
-          Scalar_> {
-  /// This type.
-  typedef WmmaMatrix<GemmOperand::kC, kLayout_, Scalar_, WmmaShape_> This_;
-  /// The layout.
-  static MatrixLayout::Kind const kLayout = kLayout_;
-
-  /// Fill-in the element.
-  CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
-    nvcuda::wmma::fill_fragment(*this, x);
-    return *this;
-  }
-
-  /// Load from memory.
-  CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
-    bool const kIsRowMajor = kLayout == MatrixLayout::kRowMajor;
-    nvcuda::wmma::load_matrix_sync(
-        *this,
-        pointer,
-        stride,
-        kIsRowMajor ? nvcuda::wmma::mem_row_major : nvcuda::wmma::mem_col_major);
-  }
-
-  /// Store to memory.
-  CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
-    bool const kIsRowMajor = kLayout == MatrixLayout::kRowMajor;
-    nvcuda::wmma::store_matrix_sync(
-        pointer,
-        *this,
-        stride,
-        kIsRowMajor ? nvcuda::wmma::mem_row_major : nvcuda::wmma::mem_col_major);
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-}  // namespace cutlass
-
-#endif  // defined CUTLASS_USE_WMMA_API
--- a/docs/_config.yml
+++ b/docs/_config.yml
@ -0,0 +1 @@
+theme: jekyll-theme-minimal
--- a/docs/aligned__buffer_8h.html
+++ b/docs/aligned__buffer_8h.html
@ -0,0 +1,145 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: aligned_buffer.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">aligned_buffer.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for aligned_buffer.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="aligned__buffer_8h__incl.png" border="0" usemap="#aligned__buffer_8h" alt=""/></div>
+<map name="aligned__buffer_8h" id="aligned__buffer_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="aligned__buffer_8h__dep__incl.png" border="0" usemap="#aligned__buffer_8hdep" alt=""/></div>
+<map name="aligned__buffer_8hdep" id="aligned__buffer_8hdep">
+</map>
+</div>
+</div>
+<p><a href="aligned__buffer_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1AlignedBuffer.html">cutlass::AlignedBuffer&lt; T, N, Align &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment.  <a href="structcutlass_1_1AlignedBuffer.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/aligned__buffer_8hdepincl.md5
+++ b/docs/aligned__buffer_8hdepincl.md5
@ -0,0 +1 @@
+6cbc6b81ede44b5f08afd4f4519d56d1
--- a/docs/aligned__buffer_8h__incl.md5
+++ b/docs/aligned__buffer_8h__incl.md5
@ -0,0 +1 @@
+b26c62930ff7668b89f2ee6624e0be3a
--- a/docs/aligned__buffer_8h_source.html
+++ b/docs/aligned__buffer_8h_source.html
--- a/docs/annotated.html
+++ b/docs/annotated.html
@ -0,0 +1,867 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: Class List</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li class="current"><a href="annotated.html"><span>Classes</span></a></li>
+      <li><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li class="current"><a href="annotated.html"><span>Class&#160;List</span></a></li>
+      <li><a href="classes.html"><span>Class&#160;Index</span></a></li>
+      <li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
+      <li><a href="functions.html"><span>Class&#160;Members</span></a></li>
+    </ul>
+  </div>
+</div><!-- top -->
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Class List</div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="textblock">Here are the classes, structs, unions and interfaces with brief descriptions:</div><div class="directory">
+<div class="levels">[detail level <span onclick="javascript:toggleLevel(1);">1</span><span onclick="javascript:toggleLevel(2);">2</span><span onclick="javascript:toggleLevel(3);">3</span><span onclick="javascript:toggleLevel(4);">4</span><span onclick="javascript:toggleLevel(5);">5</span><span onclick="javascript:toggleLevel(6);">6</span>]</div><table class="directory">
+<tr id="row_0_" class="even"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_0_" class="arrow" onclick="toggleFolder('0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass.html" target="_self">cutlass</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_0_" class="arrow" onclick="toggleFolder('0_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1arch.html" target="_self">arch</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation </td></tr>
+<tr id="row_0_0_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 2 &gt;, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 4 &gt;, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 2, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_0116_00_014_01_4_00_0132_00_01half_0bcc4d05f9811035f08cc1b7f0154a4d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 16, 4 &gt;, 32, half_t, LayoutA, half_t, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation specialized for the entire warp </td></tr>
+<tr id="row_0_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__02a3f19a78995f97d793a668e0e4d4f0.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__96363097c47b056f0ca1911afd7f8b7a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation - F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_01128_01_4_00_0132_00_01uint15918972b95027764b3a849b03075ed2b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 128 &gt;, 32, uint1b_t, layout::RowMajor, uint1b_t, layout::ColumnMajor, int, layout::RowMajor, OpXorPopc &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__927179f46017ea5f58f859f1196c4829.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
+<tr id="row_0_0_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__8ebae0cbdf333fddfe5c24d35ebe8e02.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
+<tr id="row_0_0_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__5299c9c90c8f2f521be0c8cec1c3eb08.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__f083347e265b1e9eea5572d86ddb6bf9.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_a62aa63a212985df306fb27e8a50aeae.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
+<tr id="row_0_0_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_ab741d81fdc991345cb9e43c29fca573.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
+<tr id="row_0_0_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_5221708cec5828d35db1d1c47cb4964e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_bef0c048bc0f8ba2d875cb7ab26d363b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
+<tr id="row_0_0_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_6e513ccbc44ae7909a60d93b9b5435b3.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
+<tr id="row_0_0_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_0ee08a4520882d24ba9026879265e892.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
+<tr id="row_0_0_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_4746fc55e614df0016c518d3fda2677e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
+<tr id="row_0_0_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_546e9ec6de6a5970b326da6f6280f1d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
+<tr id="row_0_0_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b03e3b50dbcb30d0d1ac062f3a9d5abef.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
+<tr id="row_0_0_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b6d968039dde5c9f062ab15f90a8049fe.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
+<tr id="row_0_0_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4bc4b6ba004e25c44bfd9266c61f937dfb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
+<tr id="row_0_0_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b451d5cf5d7e8cbbe476afe3dab5c09b2.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
+<tr id="row_0_0_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_b0242d7a01097510effbc4718040d3e5.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_c7f88bfd32a544fba8111d2dcadeab11.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_44a3b2a8df88a2b067f1284515cb5371.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_4b7308177b308a272c1889fbe9670275.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_5a9888862cebd333ecaf11f7262f77d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_31defda8ea2b7d855642ffd77da1a411.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_839a7c8bb938d1661f4611e68f85d8cb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
+<tr id="row_0_0_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_73d9802d6b944a5299bc255887db6bbc.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
+<tr id="row_0_0_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmma.html" target="_self">PtxWmma</a></td><td class="desc">WMMA Matrix multiply-add operation </td></tr>
+<tr id="row_0_0_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadA.html" target="_self">PtxWmmaLoadA</a></td><td class="desc">WMMA PTX string load for A, B, and C matrices </td></tr>
+<tr id="row_0_0_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadB.html" target="_self">PtxWmmaLoadB</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadC.html" target="_self">PtxWmmaLoadC</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaStoreD.html" target="_self">PtxWmmaStoreD</a></td><td class="desc">WMMA store for matrix D </td></tr>
+<tr id="row_0_0_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm50.html" target="_self">Sm50</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm60.html" target="_self">Sm60</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm61.html" target="_self">Sm61</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm70.html" target="_self">Sm70</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm72.html" target="_self">Sm72</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm75.html" target="_self">Sm75</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1half__t_00_01LayoutA___00_01cutlass_1_84e30c8cc93eeb7ca02f651bd16d4c38.html" target="_self">Wmma&lt; Shape_, cutlass::half_t, LayoutA_, cutlass::half_t, LayoutB_, ElementC_, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1int4b__t_00_01LayoutA___00_01cutlass_16fd808a90b3cf9d7cfc99f30888ca3fe.html" target="_self">Wmma&lt; Shape_, cutlass::int4b_t, LayoutA_, cutlass::int4b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1uint1b__t_00_01LayoutA___00_01cutlass_c80a7ea4d219cd9b13b560b493338028.html" target="_self">Wmma&lt; Shape_, cutlass::uint1b_t, LayoutA_, cutlass::uint1b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpXorPopc &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01int8__t_00_01LayoutA___00_01int8__t_00_01LayoutB_505c57bb6818a941dc16f00cf35a9ec0.html" target="_self">Wmma&lt; Shape_, int8_t, LayoutA_, int8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_0_61_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01uint8__t_00_01LayoutA___00_01uint8__t_00_01Layout219a464a1248ebfc37aa29bcb10cb1b0.html" target="_self">Wmma&lt; Shape_, uint8_t, LayoutA_, uint8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_1_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_1_" class="arrow" onclick="toggleFolder('0_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1device__memory.html" target="_self">device_memory</a></td><td class="desc"></td></tr>
+<tr id="row_0_1_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_1_0_" class="arrow" onclick="toggleFolder('0_1_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation.html" target="_self">allocation</a></td><td class="desc">Device allocation abstraction that tracks size and capacity </td></tr>
+<tr id="row_0_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation_1_1deleter.html" target="_self">deleter</a></td><td class="desc">Delete functor for CUDA device memory </td></tr>
+<tr id="row_0_2_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_2_" class="arrow" onclick="toggleFolder('0_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue.html" target="_self">epilogue</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_0_" class="arrow" onclick="toggleFolder('0_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_0_" class="arrow" onclick="toggleFolder('0_2_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1Convert.html" target="_self">Convert</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1Convert_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_1_" class="arrow" onclick="toggleFolder('0_2_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombination.html" target="_self">LinearCombination</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_1_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombination_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_2_" class="arrow" onclick="toggleFolder('0_2_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp.html" target="_self">LinearCombinationClamp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_3_" class="arrow" onclick="toggleFolder('0_2_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu.html" target="_self">LinearCombinationRelu</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_4_" class="arrow" onclick="toggleFolder('0_2_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_01int_00_01float_00_01Round_01_4.html" target="_self">LinearCombinationRelu&lt; ElementOutput_, Count, int, float, Round &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_00274a94522c46cd041d0b10d484e2ef3.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_5_" class="arrow" onclick="toggleFolder('0_2_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus.html" target="_self">ReductionOpPlus</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_0_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
+<tr id="row_0_2_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_1_" class="arrow" onclick="toggleFolder('0_2_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_" class="arrow" onclick="toggleFolder('0_2_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" target="_self">RowArrangement</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> determines how one or more warps cover a region of consecutive rows </td></tr>
+<tr id="row_0_2_1_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini91159e6f7e123d881e3ec45101fa4f81.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, false &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 1D tiled arrangement </td></tr>
+<tr id="row_0_2_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_2_" class="arrow" onclick="toggleFolder('0_2_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini6d8790249bf12cac580da73bb37eb791.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, true &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 2D tiled arrangement </td></tr>
+<tr id="row_0_2_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemainief28e98b3f284469f271d28aba73de2e.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueComplexTensorOp.html" target="_self">DefaultEpilogueComplexTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueSimt.html" target="_self">DefaultEpilogueSimt</a></td><td class="desc">Defines sensible defaults for epilogues for SimtOps </td></tr>
+<tr id="row_0_2_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueTensorOp.html" target="_self">DefaultEpilogueTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueVoltaTensorOp.html" target="_self">DefaultEpilogueVoltaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
+<tr id="row_0_2_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueWmmaTensorOp.html" target="_self">DefaultEpilogueWmmaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for WMMA TensorOps </td></tr>
+<tr id="row_0_2_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedEpilogueTensorOp.html" target="_self">DefaultInterleavedEpilogueTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_7_" class="arrow" onclick="toggleFolder('0_2_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp.html" target="_self">DefaultInterleavedThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_8_" class="arrow" onclick="toggleFolder('0_2_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt.html" target="_self">DefaultThreadMapSimt</a></td><td class="desc">Defines the optimal thread map for SIMT accumulator layouts </td></tr>
+<tr id="row_0_2_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_9_" class="arrow" onclick="toggleFolder('0_2_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp.html" target="_self">DefaultThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp.html" target="_self">DefaultThreadMapVoltaTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_11_" class="arrow" onclick="toggleFolder('0_2_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__95db04b7b72e34283958bd7fbf851d16.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, float &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__52116c60c62f0fd520071558e42b814f.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_12_" class="arrow" onclick="toggleFolder('0_2_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__d58c94abc36b7c5c109b55202c6992e7.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, half_t &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__4433cc988100e98097a748d2670fb0fc.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_13_" class="arrow" onclick="toggleFolder('0_2_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp.html" target="_self">DefaultThreadMapWmmaTensorOp</a></td><td class="desc">Defines the optimal thread map for Wmma TensorOp accumulator layouts </td></tr>
+<tr id="row_0_2_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_14_" class="arrow" onclick="toggleFolder('0_2_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp.html" target="_self">DirectEpilogueTensorOp</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator </td></tr>
+<tr id="row_0_2_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure for host-constructible state </td></tr>
+<tr id="row_0_2_1_14_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" target="_self">Epilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
+<tr id="row_0_2_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_16_" class="arrow" onclick="toggleFolder('0_2_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase.html" target="_self">EpilogueBase</a></td><td class="desc">Base class for epilogues defining warp-level </td></tr>
+<tr id="row_0_2_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_17_" class="arrow" onclick="toggleFolder('0_2_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue.html" target="_self">InterleavedEpilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
+<tr id="row_0_2_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_2_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_18_" class="arrow" onclick="toggleFolder('0_2_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap.html" target="_self">InterleavedOutputTileThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_19_" class="arrow" onclick="toggleFolder('0_2_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator.html" target="_self">InterleavedPredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
+<tr id="row_0_2_1_19_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_20_" class="arrow" onclick="toggleFolder('0_2_1_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap.html" target="_self">OutputTileOptimalThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1CompactedThreadMap.html" target="_self">CompactedThreadMap</a></td><td class="desc">Compacted thread map in which the 4D region is contiguous </td></tr>
+<tr id="row_0_2_1_20_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileShape.html" target="_self">OutputTileShape</a></td><td class="desc">Tuple defining point in output tile </td></tr>
+<tr id="row_0_2_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileThreadMap.html" target="_self">OutputTileThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_23_" class="arrow" onclick="toggleFolder('0_2_1_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
+<tr id="row_0_2_1_23_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1SharedLoadIterator.html" target="_self">SharedLoadIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_2_" class="arrow" onclick="toggleFolder('0_2_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp.html" target="_self">FragmentIteratorComplexTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp_3_01WarpShape___00_01Operato8cf03c624cf3210c71b7cbd580b080f8.html" target="_self">FragmentIteratorComplexTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt.html" target="_self">FragmentIteratorSimt</a></td><td class="desc">Fragment iterator for SIMT accumulator arrangements </td></tr>
+<tr id="row_0_2_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt_3_01WarpShape___00_01Operator___00_01la3f2abc523201c1b0228df99119ab88e1.html" target="_self">FragmentIteratorSimt&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp.html" target="_self">FragmentIteratorTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_e459aab140a2ce78336e584f95886726.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Dedicated to interleaved layout </td></tr>
+<tr id="row_0_2_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_5e78dabe303f20d76b00c600aab61eda.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp.html" target="_self">FragmentIteratorVoltaTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gdb805a2dc5571ac3b66e0fe6ffdcede2.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1G16e08718cffa0989cce3fe8dbc4b075b.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp.html" target="_self">FragmentIteratorWmmaTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShfdb1f120c6797383663f9fd11d0fc599.html" target="_self">FragmentIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
+<tr id="row_0_2_2_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy.html" target="_self">SimtPolicy</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy_3_01WarpShape___00_01Operator___00_01layout_1_1Rcef1c60e23e997017ae176c92931151d.html" target="_self">SimtPolicy&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy.html" target="_self">TensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
+<tr id="row_0_2_2_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout69549d10c3610d943987eb90e827bc05.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Partial specialization for column-major-interleaved </td></tr>
+<tr id="row_0_2_2_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout_1_1RowMajor_01_4.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt.html" target="_self">TileIteratorSimt</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt_3_01WarpShape___00_01Operator___00_01Elemenf2bd262ed3e202b25d5802d83965bf3b.html" target="_self">TileIteratorSimt&lt; WarpShape_, Operator_, Element_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp.html" target="_self">TileIteratorTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_20_" class="arrow" onclick="toggleFolder('0_2_2_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___003cbb32beb84b4984cb7853662096d289.html" target="_self">TileIteratorTensorOp&lt; WarpShape_, OperatorShape_, Element_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___05f11e023c9e6ee5f7a888fa4c5bbf6d1.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp.html" target="_self">TileIteratorVoltaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_22_" class="arrow" onclick="toggleFolder('0_2_2_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmS2fe0c60b727c738c622c18fc3dd76644.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemm770cbca45441d295d5d7433e8222a700.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_23_" class="arrow" onclick="toggleFolder('0_2_2_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmSa0ceeeddc22575876eb977da7f5416a8.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemmffcab2297c8de8d0013602a39c525b78.html" target="_self">Detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_2_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp.html" target="_self">TileIteratorWmmaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShape_fd6a91cd8bbd07ecd1344326b830e3a4.html" target="_self">TileIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorFragment_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
+<tr id="row_0_2_2_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy.html" target="_self">VoltaTensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
+<tr id="row_0_2_2_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_136ce744d4c1c6e8707f5a9785196194.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_2_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_1d48185f49e4d066f8e9327bf0856b7f.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
+<tr id="row_0_2_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_3_" class="arrow" onclick="toggleFolder('0_2_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1EpilogueWorkspace.html" target="_self">EpilogueWorkspace</a></td><td class="desc"></td></tr>
+<tr id="row_0_2_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_2_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
+<tr id="row_0_3_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_3_" class="arrow" onclick="toggleFolder('0_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm.html" target="_self">gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_0_" class="arrow" onclick="toggleFolder('0_3_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration.html" target="_self">DefaultGemmConfiguration</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag286687c5e6abe22d241f789fe344a465.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag3026e48abb8c905d1cc6d13d669700e4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc567cad318a31d04b70ea615d6321decd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm70, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcde61af9be1337dac1fdb210e7e7a6e01.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc485a4f0b5a7d2d4ab2c1a24da6328048.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8e2604a56dff3a7595da9ee0604ae55e.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc4fada4957d463c80a2831e47f28157c4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8ab5fd2693c6a6ec43e447acb07f784c.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcffcf31256aed23d4d8d0eab627bc0cad.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb2e258b7bd321c633dd65d3ebcf6414a.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb27bf218007928652d5b803193eab473.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcfea0f3503156e8e3fba6456f0cedafdd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassWmmaTensorOp_00_0884059ecad03bea3e86c4cf722226097.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassWmmaTensorOp, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_14_" class="arrow" onclick="toggleFolder('0_3_0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_15_" class="arrow" onclick="toggleFolder('0_3_0_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html" target="_self">Gemm&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layou1b211cc9c97c022d8fe10f2dd32c8709.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_16_" class="arrow" onclick="toggleFolder('0_3_0_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_17_" class="arrow" onclick="toggleFolder('0_3_0_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html" target="_self">GemmBatched&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_213d78696663f4231cd52c6a277c60e5.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_18_" class="arrow" onclick="toggleFolder('0_3_0_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex.html" target="_self">GemmComplex</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_19_" class="arrow" onclick="toggleFolder('0_3_0_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html" target="_self">GemmComplex&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
+<tr id="row_0_3_0_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_a3923967cafb5cb9774c320dc24baa77.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_20_" class="arrow" onclick="toggleFolder('0_3_0_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_0_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_21_" class="arrow" onclick="toggleFolder('0_3_0_21_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01ElementBbe7c1f7154ad5b5bf9d4d28301e2b457.html" target="_self">GemmSplitKParallel&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ConvertScaledOp_, ReductionOp_, ThreadblockSwizzle_, Stages, kAlignmentA, kAlignmentB, Operator_ &gt;</a></td><td class="desc">Partial specialization for column-major output </td></tr>
+<tr id="row_0_3_0_21_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01Elementafcb1aeaf2035a7ac769d7acc233423b.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
+<tr id="row_0_3_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_1_" class="arrow" onclick="toggleFolder('0_3_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_0_" class="arrow" onclick="toggleFolder('0_3_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1detail_1_1GemvBatchedStridedEpilogueScaling.html" target="_self">GemvBatchedStridedEpilogueScaling</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm.html" target="_self">DefaultGemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01layout_1_1ColumnMajorInterleave661fe54d13cc2c9153dcdf31e4beaa30.html" target="_self">DefaultGemm&lt; ElementA, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, kAlignmentA, ElementB, layout::RowMajorInterleaved&lt; InterleavedK &gt;, kAlignmentB, ElementC, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, int32_t, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, IsBetaZero &gt;</a></td><td class="desc">Partial specialization for Turing Integer Matrix Multiply Interleaved layout </td></tr>
+<tr id="row_0_3_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01Edd80343e6570718ed237122e4ebf7fb5.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 1 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for SIMT </td></tr>
+<tr id="row_0_3_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E044b039b2fe402f29b04a9f5feee5342.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm70, ThreadblockShape, WarpShape, GemmShape&lt; 8, 8, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Volta architecture </td></tr>
+<tr id="row_0_3_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E5d78d37a9ae2ec08d7d477d571df036e.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Turing Architecture </td></tr>
+<tr id="row_0_3_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_01inf48440732c1c5f42ddbfaba179861815.html" target="_self">DefaultGemm&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementC, LayoutC, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, false &gt;</a></td><td class="desc">Partial specialization for SIMT DP4A </td></tr>
+<tr id="row_0_3_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemmSplitKParallel.html" target="_self">DefaultGemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemv.html" target="_self">DefaultGemv</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_9_" class="arrow" onclick="toggleFolder('0_3_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_9_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1Gemm_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_1_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_10_" class="arrow" onclick="toggleFolder('0_3_1_10_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_10_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_10_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_11_" class="arrow" onclick="toggleFolder('0_3_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_3_1_11_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
+<tr id="row_0_3_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_2_" class="arrow" onclick="toggleFolder('0_3_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_2_0_" class="arrow" onclick="toggleFolder('0_3_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1EnableMma__Crow__SM60.html" target="_self">EnableMma_Crow_SM60</a></td><td class="desc">Determines whether to enable thread::Gemm&lt;&gt; specializations compatible with SM50 </td></tr>
+<tr id="row_0_3_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2.html" target="_self">Mma_HFMA2</a></td><td class="desc">Structure to compute the matrix product for HFMA </td></tr>
+<tr id="row_0_3_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_72621f7ab9ae4a4ba4fe9725cf8e89c1.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_94c813e3bbfb6f9857c155166f772687.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_17070298bc4cced0a1b98aee2bb6b455.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_bf6d29bb09a025e7b96942809743e28a.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l26a133b13650c1d058273e3649f60f04.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01lbba3a796be96a0276693ef6b259ecc4a.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l2aa4d2fd2e940e0d0cf7c47bc8f6017c.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l086c058a15d6c79558e4f3d9ff1dc148.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_0e1104c65871c539155bd3a0c7631928b.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::ColumnMajor, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_07ac147cb320ee0d28ff8e78eb4cd330e.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::RowMajor, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Structure to compute the matrix product </td></tr>
+<tr id="row_0_3_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01ElementA___00_01LayoutA___00_01ElementB_e41c1cd6078b6d1347fac239b0639d56.html" target="_self">Mma&lt; Shape_, ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for FFMA and DFMA GEMM </td></tr>
+<tr id="row_0_3_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA_00_01half__t_00_01L066c9d2371712cdf0cac099ca9bcc578.html" target="_self">Mma&lt; Shape_, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Structure to compute the matrix product </td></tr>
+<tr id="row_0_3_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA___00_01half__t_00_088f0e99e501b6012297eb30b4e89bcea.html" target="_self">Mma&lt; Shape_, half_t, LayoutA_, half_t, LayoutB_, half_t, layout::RowMajor, arch::OpMultiplyAdd, typename platform::enable_if&lt; detail::EnableMma_Crow_SM60&lt; LayoutA_, LayoutB_ &gt;::value &gt;::type &gt;</a></td><td class="desc">Computes matrix product when C is row-major </td></tr>
+<tr id="row_0_3_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1ColumnMajor_00_013f3785e722edc6e9aab6f866309b8623.html" target="_self">Mma&lt; Shape_, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, int8_t &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
+<tr id="row_0_3_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1RowMajor_00_01int89c659e7faf47264972bdba6cd80f42b.html" target="_self">Mma&lt; Shape_, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
+<tr id="row_0_3_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1MmaGeneric.html" target="_self">MmaGeneric</a></td><td class="desc">Gemplate that handles all packed matrix layouts </td></tr>
+<tr id="row_0_3_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_3_" class="arrow" onclick="toggleFolder('0_3_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultGemvCore.html" target="_self">DefaultGemvCore</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma.html" target="_self">DefaultMma</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_0010764e1fd5a3251a57eddafbd83eab8e.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, true &gt;</a></td><td class="desc">Specialization for column-major-interleaved output </td></tr>
+<tr id="row_0_3_3_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00c67c16f9881e4f2fda76d8ed83ebabd6.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
+<tr id="row_0_3_3_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00ce36642cae579bce6605ff8edde3c6ab.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
+<tr id="row_0_3_3_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_07e7230d4011ada5e22cfcb29103b696.html" target="_self">DefaultMma&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, 2, Operator, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore.html" target="_self">DefaultMmaCore</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab94a11a77dd0565102710907089acee0.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShafafd5c61db86cbfe90863578ddd11092.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha46446d1e3871e31d2e728f710d78c8c1.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_, &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha8da7a0cfbbe859b701fdd9f2b8566aa7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha84e9f8afb6a4ca9f5dcd219b182d16e7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha2c0d0b7cdb5c4bcb11e83c058eb65345.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha34a52cc7b2942e8c290f0032b6779b52.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaaf312aafe9da92ea9d417bcc12a8e7dc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha863d4139ccaa713bc4bde32c425f4067.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
+<tr id="row_0_3_3_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaf03a122202ad10acdc96f280106d678b.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha69bef08ea63dd930f99d9788105873dd.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha3adf608332a8c9ee7014fced0da8a9ca.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab7edfba3cdf43a07e3c4d719d87565a4.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc803d38bc1e4618c07c47f54c87ae2678.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instrucf60fe02fcdd80d28b7fd419133465dcc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc2bf00737f4ad0a9da9a8be6d3e66c152.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, ElementB_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_, AccumulatorsInRowMajor &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc24092ddc01fc83dabb7db4c14880fe60.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc4fee9f2965b8468bfb42b94a74527d22.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmBatchedIdentityThreadblockSwizzle.html" target="_self">GemmBatchedIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMMs </td></tr>
+<tr id="row_0_3_3_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmHorizontalThreadblockSwizzle.html" target="_self">GemmHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
+<tr id="row_0_3_3_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmIdentityThreadblockSwizzle.html" target="_self">GemmIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
+<tr id="row_0_3_3_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKHorizontalThreadblockSwizzle.html" target="_self">GemmSplitKHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
+<tr id="row_0_3_3_29_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKIdentityThreadblockSwizzle.html" target="_self">GemmSplitKIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
+<tr id="row_0_3_3_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1Gemv.html" target="_self">Gemv</a></td><td class="desc">Structure to compute the matrix-vector product using SIMT math instructions </td></tr>
+<tr id="row_0_3_3_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemvBatchedStridedThreadblockDefaultSwizzle.html" target="_self">GemvBatchedStridedThreadblockDefaultSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMVs </td></tr>
+<tr id="row_0_3_3_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_3_32_" class="arrow" onclick="toggleFolder('0_3_3_32_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase.html" target="_self">MmaBase</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_3_32_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage object needed by threadblock-scoped GEMM </td></tr>
+<tr id="row_0_3_3_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaPipelined.html" target="_self">MmaPipelined</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_3_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1MmaPolicy.html" target="_self">MmaPolicy</a></td><td class="desc">Policy object describing MmaTensorOp </td></tr>
+<tr id="row_0_3_3_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaSingleStage.html" target="_self">MmaSingleStage</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_4_" class="arrow" onclick="toggleFolder('0_3_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1DefaultMmaTensorOp.html" target="_self">DefaultMmaTensorOp</a></td><td class="desc">Partial specialization for m-by-n-by-kgroup </td></tr>
+<tr id="row_0_3_4_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp.html" target="_self">MmaComplexTensorOp</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp_3_01Shape___00_01complex_3_01RealElementA_01_146441010dad1f40eb51b6dae3ded216.html" target="_self">MmaComplexTensorOp&lt; Shape_, complex&lt; RealElementA &gt;, LayoutA_, complex&lt; RealElementB &gt;, LayoutB_, complex&lt; RealElementC &gt;, LayoutC_, Policy_, TransformA, TransformB, Enable &gt;</a></td><td class="desc">Partial specialization for complex*complex+complex =&gt; complex using real-valued TensorOps </td></tr>
+<tr id="row_0_3_4_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimt.html" target="_self">MmaSimt</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaSimtPolicy.html" target="_self">MmaSimtPolicy</a></td><td class="desc">Describes the arrangement and configuration of per-lane operations in warp-level matrix multiply </td></tr>
+<tr id="row_0_3_4_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator.html" target="_self">MmaSimtTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_67ca7e11a38e38f2c51b84767654a90f.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_f0ce904a9294556f15e1cc9cf7c99a93.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ea0a4e7ce3cd5d25cabf79383efdf4d9.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ada156b62fcbdce47009c5bf1321c92c.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_4ccafbc821b3a55cd532602442a74031.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::ColumnMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_8f92ea79e85febb67169c4b2d94b1b20.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::RowMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOp.html" target="_self">MmaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator.html" target="_self">MmaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_14_" class="arrow" onclick="toggleFolder('0_3_4_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___008f607b871a2b3d854eb4def64712c042.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___0d35fa5dc4e4b4f72784c943fd857fc1d.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_15_" class="arrow" onclick="toggleFolder('0_3_4_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___00027dabdc144edd6276f664ca74088510.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajorInterleaved&lt; InterleavedN &gt;, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___03822d9be37f3725022005a5434441f22.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_16_" class="arrow" onclick="toggleFolder('0_3_4_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___006c39f57875e0aa9d0ad82c8043ed8b98.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::RowMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___093b5d2838ac5a742704ef62b5c8688f0.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator.html" target="_self">MmaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0b84f53cd44b339eccc12067c9f86e11c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0e52ad425e1ee3e68544873f66733237b.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___039819fb3ccd43786d556c2c9669508ef.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0352e0dcab42bc8360606874e00173556.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_22_" class="arrow" onclick="toggleFolder('0_3_4_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0ed7daaeba1c095e77f68533d4d2c475c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, 64 &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___07638f8b7761f6e2e2e6918e2c05e739.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_23_" class="arrow" onclick="toggleFolder('0_3_4_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0c7d419c589d601ce4eb603be566fea21.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0784c74bd670999ec23ad8ef9dc55777.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpPolicy.html" target="_self">MmaTensorOpPolicy</a></td><td class="desc">Policy </td></tr>
+<tr id="row_0_3_4_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOp.html" target="_self">MmaVoltaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
+<tr id="row_0_3_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_26_" class="arrow" onclick="toggleFolder('0_3_4_26_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator.html" target="_self">MmaVoltaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_26_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator_1_1Policy.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan0d3248553e52cd61ed8a2b3b12a20343.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_29_" class="arrow" onclick="toggleFolder('0_3_4_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan34be8e21a40af3ebd2dc3dff460dca72.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera33cdf53848564e894d4407637dc86caf.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand734577b7e54a074d143aba59828c2f2.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_31_" class="arrow" onclick="toggleFolder('0_3_4_31_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan16c56cdc2dda5eeb996af8ec0242d501.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_31_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera6fa6d2d3725bb3ec613d5c527ea3ffe7.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan5a221944f4a0e16ccab77ba684856942.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operandcc9821c435540895138bc9af495f321.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_34_" class="arrow" onclick="toggleFolder('0_3_4_34_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operana2f40b28f0d2286b84d86f7238d67b52.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_4_34_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operafa294175b280756dd8388f9ffe7b72c4.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
+<tr id="row_0_3_4_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1WarpSize.html" target="_self">WarpSize</a></td><td class="desc">Query the number of threads per warp </td></tr>
+<tr id="row_0_3_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1BatchedGemmCoord.html" target="_self">BatchedGemmCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmCoord.html" target="_self">GemmCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_3_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmShape.html" target="_self">GemmShape</a></td><td class="desc">Shape of a matrix multiply-add operation </td></tr>
+<tr id="row_0_4_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_4_" class="arrow" onclick="toggleFolder('0_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1layout.html" target="_self">layout</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1ColumnMajor.html" target="_self">ColumnMajor</a></td><td class="desc">Mapping function for column-major matrices </td></tr>
+<tr id="row_0_4_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorBlockLinear.html" target="_self">ColumnMajorBlockLinear</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorInterleaved.html" target="_self">ColumnMajorInterleaved</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ContiguousMatrix.html" target="_self">ContiguousMatrix</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1GeneralMatrix.html" target="_self">GeneralMatrix</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose.html" target="_self">LayoutTranspose</a></td><td class="desc">Defines transposes of matrix layouts </td></tr>
+<tr id="row_0_4_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1ColumnMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::ColumnMajor &gt;</a></td><td class="desc">Transpose of column-major is row-major </td></tr>
+<tr id="row_0_4_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1RowMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::RowMajor &gt;</a></td><td class="desc">Transpose of row-major is column-major </td></tr>
+<tr id="row_0_4_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PackedVectorLayout.html" target="_self">PackedVectorLayout</a></td><td class="desc">Tensor layout for densely packed vectors </td></tr>
+<tr id="row_0_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PitchLinear.html" target="_self">PitchLinear</a></td><td class="desc">Mapping function for pitch-linear memory </td></tr>
+<tr id="row_0_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearCoord.html" target="_self">PitchLinearCoord</a></td><td class="desc">Coordinate in pitch-linear space </td></tr>
+<tr id="row_0_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearShape.html" target="_self">PitchLinearShape</a></td><td class="desc">Template defining a shape used by pitch-linear operators </td></tr>
+<tr id="row_0_4_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1RowMajor.html" target="_self">RowMajor</a></td><td class="desc">Mapping function for row-major matrices </td></tr>
+<tr id="row_0_4_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorBlockLinear.html" target="_self">RowMajorBlockLinear</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorInterleaved.html" target="_self">RowMajorInterleaved</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCongruous.html" target="_self">RowMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCrosswise.html" target="_self">RowMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
+<tr id="row_0_4_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">RowMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorCxRSKx.html" target="_self">TensorCxRSKx</a></td><td class="desc">Mapping function for 4-D CxRSKx tensors </td></tr>
+<tr id="row_0_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCHW.html" target="_self">TensorNCHW</a></td><td class="desc">Mapping function for 4-D NCHW tensors </td></tr>
+<tr id="row_0_4_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCxHWx.html" target="_self">TensorNCxHWx</a></td><td class="desc">Mapping function for 4-D NC/xHWx tensors </td></tr>
+<tr id="row_0_4_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNHWC.html" target="_self">TensorNHWC</a></td><td class="desc">Mapping function for 4-D NHWC tensors </td></tr>
+<tr id="row_0_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicand.html" target="_self">TensorOpMultiplicand</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandColumnMajorInterleaved.html" target="_self">TensorOpMultiplicandColumnMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous.html" target="_self">TensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous_3_0132_00_01Crosswise_01_4.html" target="_self">TensorOpMultiplicandCongruous&lt; 32, Crosswise &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCrosswise.html" target="_self">TensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandRowMajorInterleaved.html" target="_self">TensorOpMultiplicandRowMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandBCongruous.html" target="_self">VoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" target="_self">VoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
+<tr id="row_0_4_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCrosswise.html" target="_self">VoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_5_" class="arrow" onclick="toggleFolder('0_5_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1library.html" target="_self">library</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArguments.html" target="_self">GemmArguments</a></td><td class="desc">Arguments for GEMM </td></tr>
+<tr id="row_0_5_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayArguments.html" target="_self">GemmArrayArguments</a></td><td class="desc">Arguments for GEMM - used by all the GEMM operations </td></tr>
+<tr id="row_0_5_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayConfiguration.html" target="_self">GemmArrayConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
+<tr id="row_0_5_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmBatchedConfiguration.html" target="_self">GemmBatchedConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
+<tr id="row_0_5_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmConfiguration.html" target="_self">GemmConfiguration</a></td><td class="desc">Configuration for basic GEMM operations </td></tr>
+<tr id="row_0_5_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmDescription.html" target="_self">GemmDescription</a></td><td class="desc">Description of all GEMM computations </td></tr>
+<tr id="row_0_5_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexBatchedConfiguration.html" target="_self">GemmPlanarComplexBatchedConfiguration</a></td><td class="desc">Batched complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
+<tr id="row_0_5_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexConfiguration.html" target="_self">GemmPlanarComplexConfiguration</a></td><td class="desc">Complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
+<tr id="row_0_5_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Manifest.html" target="_self">Manifest</a></td><td class="desc"><a class="el" href="classcutlass_1_1library_1_1Manifest.html" title="Manifest of CUTLASS Library. ">Manifest</a> of CUTLASS Library </td></tr>
+<tr id="row_0_5_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1MathInstructionDescription.html" target="_self">MathInstructionDescription</a></td><td class="desc"></td></tr>
+<tr id="row_0_5_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Operation.html" target="_self">Operation</a></td><td class="desc">Base class for all device-wide operations </td></tr>
+<tr id="row_0_5_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1OperationDescription.html" target="_self">OperationDescription</a></td><td class="desc">High-level description of an operation </td></tr>
+<tr id="row_0_5_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TensorDescription.html" target="_self">TensorDescription</a></td><td class="desc">Structure describing the properties of a tensor </td></tr>
+<tr id="row_0_5_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TileDescription.html" target="_self">TileDescription</a></td><td class="desc">Structure describing the tiled structure of a GEMM-like computation </td></tr>
+<tr id="row_0_6_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_6_" class="arrow" onclick="toggleFolder('0_6_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1platform.html" target="_self">platform</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__chunk.html" target="_self">aligned_chunk</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__storage.html" target="_self">aligned_storage</a></td><td class="desc">Std::aligned_storage </td></tr>
+<tr id="row_0_6_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_2_" class="arrow" onclick="toggleFolder('0_6_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of.html" target="_self">alignment_of</a></td><td class="desc">Std::alignment_of </td></tr>
+<tr id="row_0_6_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_1_1pad.html" target="_self">pad</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01value__t_01_4.html" target="_self">alignment_of&lt; const value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; const volatile value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double2_01_4.html" target="_self">alignment_of&lt; double2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double4_01_4.html" target="_self">alignment_of&lt; double4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01float4_01_4.html" target="_self">alignment_of&lt; float4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01int4_01_4.html" target="_self">alignment_of&lt; int4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01long4_01_4.html" target="_self">alignment_of&lt; long4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong2_01_4.html" target="_self">alignment_of&lt; longlong2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong4_01_4.html" target="_self">alignment_of&lt; longlong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01uint4_01_4.html" target="_self">alignment_of&lt; uint4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulong4_01_4.html" target="_self">alignment_of&lt; ulong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong2_01_4.html" target="_self">alignment_of&lt; ulonglong2 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong4_01_4.html" target="_self">alignment_of&lt; ulonglong4 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; volatile value_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1bool__constant.html" target="_self">bool_constant</a></td><td class="desc">Std::bool_constant </td></tr>
+<tr id="row_0_6_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional.html" target="_self">conditional</a></td><td class="desc">Std::conditional (true specialization) </td></tr>
+<tr id="row_0_6_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional_3_01false_00_01T_00_01F_01_4.html" target="_self">conditional&lt; false, T, F &gt;</a></td><td class="desc">Std::conditional (false specialization) </td></tr>
+<tr id="row_0_6_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete.html" target="_self">default_delete</a></td><td class="desc">Default deleter </td></tr>
+<tr id="row_0_6_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete_3_01T[]_4.html" target="_self">default_delete&lt; T[]&gt;</a></td><td class="desc">Partial specialization for deleting array types </td></tr>
+<tr id="row_0_6_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if.html" target="_self">enable_if</a></td><td class="desc">Std::enable_if (true specialization) </td></tr>
+<tr id="row_0_6_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if_3_01false_00_01T_01_4.html" target="_self">enable_if&lt; false, T &gt;</a></td><td class="desc">Std::enable_if (false specialization) </td></tr>
+<tr id="row_0_6_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1integral__constant.html" target="_self">integral_constant</a></td><td class="desc">Std::integral_constant </td></tr>
+<tr id="row_0_6_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__arithmetic.html" target="_self">is_arithmetic</a></td><td class="desc">Std::is_arithmetic </td></tr>
+<tr id="row_0_6_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of.html" target="_self">is_base_of</a></td><td class="desc">Std::is_base_of </td></tr>
+<tr id="row_0_6_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_27_" class="arrow" onclick="toggleFolder('0_6_27_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper.html" target="_self">is_base_of_helper</a></td><td class="desc">Helper for std::is_base_of </td></tr>
+<tr id="row_0_6_27_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper_1_1dummy.html" target="_self">dummy</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__floating__point.html" target="_self">is_floating_point</a></td><td class="desc">Std::is_floating_point </td></tr>
+<tr id="row_0_6_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__fundamental.html" target="_self">is_fundamental</a></td><td class="desc">Std::is_fundamental </td></tr>
+<tr id="row_0_6_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral.html" target="_self">is_integral</a></td><td class="desc">Std::is_integral </td></tr>
+<tr id="row_0_6_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01char_01_4.html" target="_self">is_integral&lt; char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01T_01_4.html" target="_self">is_integral&lt; const T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01volatile_01T_01_4.html" target="_self">is_integral&lt; const volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01int_01_4.html" target="_self">is_integral&lt; int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01_4.html" target="_self">is_integral&lt; long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01long_01_4.html" target="_self">is_integral&lt; long long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01short_01_4.html" target="_self">is_integral&lt; short &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01signed_01char_01_4.html" target="_self">is_integral&lt; signed char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01char_01_4.html" target="_self">is_integral&lt; unsigned char &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01int_01_4.html" target="_self">is_integral&lt; unsigned int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01_4.html" target="_self">is_integral&lt; unsigned long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01long_01_4.html" target="_self">is_integral&lt; unsigned long long &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01short_01_4.html" target="_self">is_integral&lt; unsigned short &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01volatile_01T_01_4.html" target="_self">is_integral&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer.html" target="_self">is_pointer</a></td><td class="desc">Std::is_pointer </td></tr>
+<tr id="row_0_6_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper.html" target="_self">is_pointer_helper</a></td><td class="desc">Helper for std::is_pointer (false specialization) </td></tr>
+<tr id="row_0_6_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper_3_01T_01_5_01_4.html" target="_self">is_pointer_helper&lt; T * &gt;</a></td><td class="desc">Helper for std::is_pointer (true specialization) </td></tr>
+<tr id="row_0_6_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same.html" target="_self">is_same</a></td><td class="desc">Std::is_same (false specialization) </td></tr>
+<tr id="row_0_6_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same_3_01A_00_01A_01_4.html" target="_self">is_same&lt; A, A &gt;</a></td><td class="desc">Std::is_same (true specialization) </td></tr>
+<tr id="row_0_6_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__trivially__copyable.html" target="_self">is_trivially_copyable</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__void.html" target="_self">is_void</a></td><td class="desc">Std::is_void </td></tr>
+<tr id="row_0_6_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile.html" target="_self">is_volatile</a></td><td class="desc">Std::is_volatile </td></tr>
+<tr id="row_0_6_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile_3_01volatile_01T_01_4.html" target="_self">is_volatile&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_6_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1nullptr__t.html" target="_self">nullptr_t</a></td><td class="desc">Std::nullptr_t </td></tr>
+<tr id="row_0_6_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const.html" target="_self">remove_const</a></td><td class="desc">Std::remove_const (non-const specialization) </td></tr>
+<tr id="row_0_6_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const_3_01const_01T_01_4.html" target="_self">remove_const&lt; const T &gt;</a></td><td class="desc">Std::remove_const (const specialization) </td></tr>
+<tr id="row_0_6_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__cv.html" target="_self">remove_cv</a></td><td class="desc">Std::remove_cv </td></tr>
+<tr id="row_0_6_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile.html" target="_self">remove_volatile</a></td><td class="desc">Std::remove_volatile (non-volatile specialization) </td></tr>
+<tr id="row_0_6_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile_3_01volatile_01T_01_4.html" target="_self">remove_volatile&lt; volatile T &gt;</a></td><td class="desc">Std::remove_volatile (volatile specialization) </td></tr>
+<tr id="row_0_6_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1platform_1_1unique__ptr.html" target="_self">unique_ptr</a></td><td class="desc">Std::unique_ptr </td></tr>
+<tr id="row_0_7_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_7_" class="arrow" onclick="toggleFolder('0_7_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction.html" target="_self">reduction</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_0_" class="arrow" onclick="toggleFolder('0_7_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_0_0_" class="arrow" onclick="toggleFolder('0_7_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK.html" target="_self">ReduceSplitK</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" target="_self">Params</a></td><td class="desc"><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" title="Params structure. ">Params</a> structure </td></tr>
+<tr id="row_0_7_0_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_1_" class="arrow" onclick="toggleFolder('0_7_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" target="_self">Reduce</a></td><td class="desc">Structure to compute the thread level reduction </td></tr>
+<tr id="row_0_7_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01AlignedArray_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, AlignedArray&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for AlignedArray&lt;half_t, N&gt; </td></tr>
+<tr id="row_0_7_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;half_t, N&gt; </td></tr>
+<tr id="row_0_7_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Partial specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;T, N&gt; </td></tr>
+<tr id="row_0_7_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01T_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, T &gt;</a></td><td class="desc">Partial Specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for "plus" (a functional operator) </td></tr>
+<tr id="row_0_7_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_1_5_" class="arrow" onclick="toggleFolder('0_7_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd.html" target="_self">ReduceAdd</a></td><td class="desc">Mixed-precision reduction </td></tr>
+<tr id="row_0_7_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReduction.html" target="_self">BatchedReduction</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_3_" class="arrow" onclick="toggleFolder('0_7_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits.html" target="_self">BatchedReductionTraits</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
+<tr id="row_0_7_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1DefaultBlockSwizzle.html" target="_self">DefaultBlockSwizzle</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_8_" class="arrow" onclick="toggleFolder('0_8_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference.html" target="_self">reference</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_0_" class="arrow" onclick="toggleFolder('0_8_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast.html" target="_self">Cast</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01int8__t_01_4.html" target="_self">Cast&lt; float, int8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01uint8__t_01_4.html" target="_self">Cast&lt; float, uint8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_1_" class="arrow" onclick="toggleFolder('0_8_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_" class="arrow" onclick="toggleFolder('0_8_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_0_" class="arrow" onclick="toggleFolder('0_8_1_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_0_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_1_" class="arrow" onclick="toggleFolder('0_8_1_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_1_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_2_" class="arrow" onclick="toggleFolder('0_8_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc.html" target="_self">TensorCopyDiagonalInFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_3_" class="arrow" onclick="toggleFolder('0_8_1_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc.html" target="_self">TensorCopyDiagonalOutFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_3_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_4_" class="arrow" onclick="toggleFolder('0_8_1_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_4_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_5_" class="arrow" onclick="toggleFolder('0_8_1_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_5_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_6_" class="arrow" onclick="toggleFolder('0_8_1_0_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc.html" target="_self">TensorFillRandomGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_6_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_7_" class="arrow" onclick="toggleFolder('0_8_1_0_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_7_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_8_" class="arrow" onclick="toggleFolder('0_8_1_0_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc.html" target="_self">TensorUpdateDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_8_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_9_" class="arrow" onclick="toggleFolder('0_8_1_0_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_1_0_9_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
+<tr id="row_0_8_1_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_" class="arrow" onclick="toggleFolder('0_8_1_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_0_" class="arrow" onclick="toggleFolder('0_8_1_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
+<tr id="row_0_8_1_1_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_1_1_0_1_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_2_" class="arrow" onclick="toggleFolder('0_8_1_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1thread_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc">Thread-level blocked general matrix product </td></tr>
+<tr id="row_0_8_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout4e016ab7cfc644acd7cb4ae770339773.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
+<tr id="row_0_8_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout30b72addd464a2ca4a26785cbfd77a8e.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
+<tr id="row_0_8_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout660562b232f408218828ca5915b7e73a.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc &gt;</a></td><td class="desc">Partial specialization for XOR-popc </td></tr>
+<tr id="row_0_8_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorDiagonalForEach.html" target="_self">TensorDiagonalForEach</a></td><td class="desc">Launches a kernel calling a functor for each element along a tensor's diagonal </td></tr>
+<tr id="row_0_8_1_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorForEach.html" target="_self">TensorForEach</a></td><td class="desc">Launches a kernel calling a functor for each element in a tensor's index space </td></tr>
+<tr id="row_0_8_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_2_" class="arrow" onclick="toggleFolder('0_8_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host.html" target="_self">host</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_2_0_" class="arrow" onclick="toggleFolder('0_8_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
+<tr id="row_0_8_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomGaussianFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
+<tr id="row_0_8_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomUniformFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
+<tr id="row_0_8_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorContainsFunc.html" target="_self">TensorContainsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorCopyIf.html" target="_self">TensorCopyIf</a></td><td class="desc">Helper to conditionally copy between tensor views </td></tr>
+<tr id="row_0_8_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorEqualsFunc.html" target="_self">TensorEqualsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillFunc.html" target="_self">TensorFillFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillGaussianFunc.html" target="_self">TensorFillGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
+<tr id="row_0_8_2_0_12_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_2_0_13_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
+<tr id="row_0_8_2_0_14_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFuncBinaryOp.html" target="_self">TensorFuncBinaryOp</a></td><td class="desc">Helper to apply a binary operator in place </td></tr>
+<tr id="row_0_8_2_0_15_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
+<tr id="row_0_8_2_0_16_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TrivialConvert.html" target="_self">TrivialConvert</a></td><td class="desc">Helper to convert between types </td></tr>
+<tr id="row_0_8_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
+<tr id="row_0_8_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_193dd3a37f00deff1e5dcd7c310afb1f.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
+<tr id="row_0_8_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_55729eac7dbd6bf311ea36f680e83e93.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
+<tr id="row_0_8_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_4f3f32c4b336238abfd741e87bfced46.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc &gt;</a></td><td class="desc">Partial specialization for XOR-popc </td></tr>
+<tr id="row_0_9_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_9_" class="arrow" onclick="toggleFolder('0_9_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_9_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1thread_1_1Matrix.html" target="_self">Matrix</a></td><td class="desc">Per-thread matrix object storing a packed matrix </td></tr>
+<tr id="row_0_10_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_10_" class="arrow" onclick="toggleFolder('0_10_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform.html" target="_self">transform</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_0_" class="arrow" onclick="toggleFolder('0_10_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1thread_1_1Transpose.html" target="_self">Transpose</a></td><td class="desc">Transforms a fragment by doing a transpose </td></tr>
+<tr id="row_0_10_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1thread_1_1Transpose_3_01ElementCount___00_01layout_1_1PitchLinearS99f8e05faf0bb5ed48a0154afe740d81.html" target="_self">Transpose&lt; ElementCount_, layout::PitchLinearShape&lt; 4, 4 &gt;, int8_t &gt;</a></td><td class="desc">Specialization for int8_t 4x4 transpose </td></tr>
+<tr id="row_0_10_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_1_" class="arrow" onclick="toggleFolder('0_10_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator.html" target="_self">PredicatedTileAccessIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile.html" target="_self">PredicatedTileAccessIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_2_" class="arrow" onclick="toggleFolder('0_10_1_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__da632779aba661c0f4cfaaa78126b771.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__18e9cf25bb3b8edfaad595241a6dc2d7.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_3_" class="arrow" onclick="toggleFolder('0_10_1_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__1790abaa54a01f277d75766d5882fec8.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__8ccc62d47a092afc8bee32ffe9d1e4ba.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_4_" class="arrow" onclick="toggleFolder('0_10_1_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__7327fa15996bcb8502cdfcc192350fe1.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__a56cbccec33ee916292ad9d068474609.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_5_" class="arrow" onclick="toggleFolder('0_10_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen89c687c583745a73cb485041911a4c4e.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenc07b5ec72f83e782121ac629288d61fe.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_6_" class="arrow" onclick="toggleFolder('0_10_1_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenab63a1e105bf37f6371516cb9e2c5a7a.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_6_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemena9b06926a275b569ee9f7f142604b997.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_7_" class="arrow" onclick="toggleFolder('0_10_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen784a0e9da3f55064c47e5613791f51f7.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen41e459f664d17473570cf22fb616845f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_8_" class="arrow" onclick="toggleFolder('0_10_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen9838736ad62fae54213fbaf722a989ab.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen44ce348364e78f5a56fa0c2cef6af930.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_9_" class="arrow" onclick="toggleFolder('0_10_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen809793e785fb4211888c6b4e5dcfcb39.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen058417e2cdd86f3cd6ad5458581571c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile.html" target="_self">PredicatedTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_12_" class="arrow" onclick="toggleFolder('0_10_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0165b39a630d10785a3558406f9adb99b9.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_01e11ed7192af5d7ad1bce5641fa13112e.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_13_" class="arrow" onclick="toggleFolder('0_10_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_017a517f3c73efd795ab05059cc9b111e1.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0b878062cc0cd214bf7e17d74ff17e246.html" target="_self">AccessType</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_13_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0145ef045e8f7d57dc718098adcb00cf3d.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_14_" class="arrow" onclick="toggleFolder('0_10_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_013671177d6219bfeb0e1b4dc4c1b5bf11.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0102e766863c6ac9ec2063a02c4803eecb.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_15_" class="arrow" onclick="toggleFolder('0_10_1_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0068b3e874b5d93d11f0fa902c7f1d11d9.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00a6b756b1bcfbb35fe4a3e68ff074e380.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_16_" class="arrow" onclick="toggleFolder('0_10_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00f6b3a9dfab5e7c72d5233f7e5e6e3b9b.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00ebd1a63351e1085d0b718582ec7b06c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_17_" class="arrow" onclick="toggleFolder('0_10_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00e7c2c404e7aedfe60ad56bb5571306a1.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___006a5f2f7a8271031e6cdc5daa5441f2af.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_18_" class="arrow" onclick="toggleFolder('0_10_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0041ea81994f8af0d4d071fdb9e66b5ff0.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___004d0f9b5e19c29acc17bcdc360dafebbd.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_19_" class="arrow" onclick="toggleFolder('0_10_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00d670f969180a8d182dffb356ebcc957e.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___009fd89f6dad84238fd7d63df0a0c0364f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
+<tr id="row_0_10_1_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator.html" target="_self">RegularTileAccessIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__eb7d20f8b9d69e0ae5e7ef51dc480867.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__2c1476eaf582bfe972793e17babfe985.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__a3c11cf1f00ef7a1efb8389ac6e4c6e0.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0855e9d9ab619202d2397180c1e4c4a5.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__f04332958a49a47d6fb2b25201764630.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__6baada077236f1a368c61c5e11b45b72.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0184b7188941788a96624510a4b2f876.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_28_" class="arrow" onclick="toggleFolder('0_10_1_28_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__ebf4714349612673e8b6609b763eeb6f.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_28_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_0a9491607d11be8e1780e79ad711aa42.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_29_" class="arrow" onclick="toggleFolder('0_10_1_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__e9a9e0f4286f652f55eb9b863b21effe.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_3be8b96d170d886f39b6b30acab65e7a.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator.html" target="_self">RegularTileIterator</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile.html" target="_self">RegularTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Eleb60d066756d1c18f05fceee6a27bdb8a.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele76ed82829532ae1c17f4c78158f036c7.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele654c8f6161ae5340f040397a4e2e045c.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
+<tr id="row_0_10_1_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_011d3637dbd8bc58bcb020b51bf57fbfc0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_36_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_017982f81d4ef592e19c8427de2ea933a3.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_37_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_010889a732373c350de9b9a9f6c13cd761.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_38_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01187f8574e1fe9d7d5e8fbf09bd834bf0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_39_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01793f74bfd8f116a827948ab01a37349a.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_40_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01bd31b3810c1fedf2e7e5959ff92b5d3d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kRow &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_41_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0184a89653916f5d51ab59d1b386989a17.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_42_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0149454d361ea5885cf5166a920b5145df.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
+<tr id="row_0_10_1_43_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01c20d35180520077a5a09b1e33543c1a5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_44_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a31b454d9c930525c1e9ca406a514f40.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_45_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0104ad31bd559a88cc418ae1cab7492ed5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_46_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f6f6511b5033cad31083644ac69c54d8.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_47_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01b3fa5720e807697de61b9f937b269cd0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kColumn &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_48_" class="arrow" onclick="toggleFolder('0_10_1_48_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01efd5013a2503d6567e2bf6b40c97360c.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_48_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_052caec9d5bceeb59b9a13cb3338ce64d.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_49_" class="arrow" onclick="toggleFolder('0_10_1_49_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0197fef2242a3454a7d1cebe61aee28b43.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_49_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_039093927f4b1ee61538c569bf1ae4efd.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_50_" class="arrow" onclick="toggleFolder('0_10_1_50_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a75d2cd74e722d6ad6a3b41aabfd432d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_50_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_02d305cfb0b55c6fb236a52cf2240651e.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_51_" class="arrow" onclick="toggleFolder('0_10_1_51_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f96bbeb63e6d4ce4a2551279de3a9f0e.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_51_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_032f88d1be8b209e44a4815c707ba35bb.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_1_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_52_" class="arrow" onclick="toggleFolder('0_10_1_52_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01dbd6b8468d5bd787308d2f615a24d123.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kContiguous &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_1_52_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0390833403016f5d817416e20828845df.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
+<tr id="row_0_10_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_3_" class="arrow" onclick="toggleFolder('0_10_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread0082c3467229b12cc9dd996283ee7160.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap&lt; Shape_, Threads, cutlass::layout::PitchLinearShape&lt; 4, 4 &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread896c01a3c466da1bf392e0cdfced4d53.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
+<tr id="row_0_10_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_4_" class="arrow" onclick="toggleFolder('0_10_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap.html" target="_self">PitchLinearStripminedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
+<tr id="row_0_10_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadContiguous.html" target="_self">PitchLinearTilePolicyStripminedThreadContiguous</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadStrided.html" target="_self">PitchLinearTilePolicyStripminedThreadStrided</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_7_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_7_" class="arrow" onclick="toggleFolder('0_10_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap.html" target="_self">PitchLinearWarpRakedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_7_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_8_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_8_" class="arrow" onclick="toggleFolder('0_10_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap.html" target="_self">PitchLinearWarpStripedThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_8_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_9_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_9_" class="arrow" onclick="toggleFolder('0_10_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap.html" target="_self">TransposePitchLinearThreadMap</a></td><td class="desc"></td></tr>
+<tr id="row_0_10_9_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
+<tr id="row_0_10_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap2DThreadTile.html" target="_self">TransposePitchLinearThreadMap2DThreadTile</a></td><td class="desc">Thread Mapping a 2D threadtiled mapping as a transposed Pitchlinear2DThreadTile mapping </td></tr>
+<tr id="row_0_10_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMapSimt.html" target="_self">TransposePitchLinearThreadMapSimt</a></td><td class="desc"></td></tr>
+<tr id="row_0_11_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1AlignedArray.html" target="_self">AlignedArray</a></td><td class="desc">Aligned array type </td></tr>
+<tr id="row_0_12_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1AlignedBuffer.html" target="_self">AlignedBuffer</a></td><td class="desc">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment </td></tr>
+<tr id="row_0_13_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_13_" class="arrow" onclick="toggleFolder('0_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html" target="_self">Array&lt; T, N, false &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_13_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_13_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html" target="_self">const_reference</a></td><td class="desc">Reference object extracts sub-byte items </td></tr>
+<tr id="row_0_13_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_13_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_13_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reference.html" target="_self">reference</a></td><td class="desc">Reference object inserts or extracts sub-byte items </td></tr>
+<tr id="row_0_13_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_14_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_14_" class="arrow" onclick="toggleFolder('0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4.html" target="_self">Array&lt; T, N, true &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_14_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_14_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
+<tr id="row_0_14_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_14_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
+<tr id="row_0_15_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1CommandLine.html" target="_self">CommandLine</a></td><td class="desc"></td></tr>
+<tr id="row_0_16_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1complex.html" target="_self">complex</a></td><td class="desc"></td></tr>
+<tr id="row_0_17_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1ConstSubbyteReference.html" target="_self">ConstSubbyteReference</a></td><td class="desc"></td></tr>
+<tr id="row_0_18_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Coord.html" target="_self">Coord</a></td><td class="desc">Statically-sized array specifying Coords within a tensor </td></tr>
+<tr id="row_0_19_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1cuda__exception.html" target="_self">cuda_exception</a></td><td class="desc">C++ exception wrapper for CUDA <code>cudaError_t</code> </td></tr>
+<tr id="row_0_20_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Distribution.html" target="_self">Distribution</a></td><td class="desc"><a class="el" href="structcutlass_1_1Distribution.html" title="Distribution type. ">Distribution</a> type </td></tr>
+<tr id="row_0_21_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divide__assert.html" target="_self">divide_assert</a></td><td class="desc"></td></tr>
+<tr id="row_0_22_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides.html" target="_self">divides</a></td><td class="desc"></td></tr>
+<tr id="row_0_23_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_24_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_25_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType.html" target="_self">FloatType</a></td><td class="desc">Defines a floating-point type based on the number of exponent and mantissa bits </td></tr>
+<tr id="row_0_26_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_0111_00_0152_01_4.html" target="_self">FloatType&lt; 11, 52 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_015_00_0110_01_4.html" target="_self">FloatType&lt; 5, 10 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_28_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_018_00_0123_01_4.html" target="_self">FloatType&lt; 8, 23 &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_29_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1half__t.html" target="_self">half_t</a></td><td class="desc">IEEE half-precision floating-point type </td></tr>
+<tr id="row_0_30_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1HostTensor.html" target="_self">HostTensor</a></td><td class="desc">Host tensor </td></tr>
+<tr id="row_0_31_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1IdentityTensorLayout.html" target="_self">IdentityTensorLayout</a></td><td class="desc"></td></tr>
+<tr id="row_0_32_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1integer__subbyte.html" target="_self">integer_subbyte</a></td><td class="desc">4-bit signed integer type </td></tr>
+<tr id="row_0_33_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType.html" target="_self">IntegerType</a></td><td class="desc">Defines integers based on size and whether they are signed </td></tr>
+<tr id="row_0_34_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01false_01_4.html" target="_self">IntegerType&lt; 1, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_35_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01true_01_4.html" target="_self">IntegerType&lt; 1, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_36_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01false_01_4.html" target="_self">IntegerType&lt; 16, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_37_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01true_01_4.html" target="_self">IntegerType&lt; 16, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_38_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01false_01_4.html" target="_self">IntegerType&lt; 32, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_39_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01true_01_4.html" target="_self">IntegerType&lt; 32, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_40_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01false_01_4.html" target="_self">IntegerType&lt; 4, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_41_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01true_01_4.html" target="_self">IntegerType&lt; 4, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_42_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01false_01_4.html" target="_self">IntegerType&lt; 64, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_43_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01true_01_4.html" target="_self">IntegerType&lt; 64, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_44_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01false_01_4.html" target="_self">IntegerType&lt; 8, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_45_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01true_01_4.html" target="_self">IntegerType&lt; 8, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_46_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1is__pow2.html" target="_self">is_pow2</a></td><td class="desc"></td></tr>
+<tr id="row_0_47_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1KernelLaunchConfiguration.html" target="_self">KernelLaunchConfiguration</a></td><td class="desc">Structure containing the basic launch configuration of a CUDA kernel </td></tr>
+<tr id="row_0_48_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down.html" target="_self">log2_down</a></td><td class="desc"></td></tr>
+<tr id="row_0_49_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_down&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_50_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up.html" target="_self">log2_up</a></td><td class="desc"></td></tr>
+<tr id="row_0_51_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_up&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_52_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixCoord.html" target="_self">MatrixCoord</a></td><td class="desc"></td></tr>
+<tr id="row_0_53_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixShape.html" target="_self">MatrixShape</a></td><td class="desc">Describes the size of a matrix tile </td></tr>
+<tr id="row_0_54_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Max.html" target="_self">Max</a></td><td class="desc"></td></tr>
+<tr id="row_0_55_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum.html" target="_self">maximum</a></td><td class="desc"></td></tr>
+<tr id="row_0_56_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">maximum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_57_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01float_01_4.html" target="_self">maximum&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_58_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Min.html" target="_self">Min</a></td><td class="desc"></td></tr>
+<tr id="row_0_59_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum.html" target="_self">minimum</a></td><td class="desc"></td></tr>
+<tr id="row_0_60_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minimum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_61_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01float_01_4.html" target="_self">minimum&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_62_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus.html" target="_self">minus</a></td><td class="desc"></td></tr>
+<tr id="row_0_63_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_64_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_65_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies.html" target="_self">multiplies</a></td><td class="desc"></td></tr>
+<tr id="row_0_66_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_67_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_68_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add.html" target="_self">multiply_add</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_69_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01half__t_00_01N_01_4_00_01Array_3_01half__t_00_01N_01adaeadb27c0e4439444709c0eb30963.html" target="_self">multiply_add&lt; Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_70_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiply_add&lt; Array&lt; T, N &gt;, Array&lt; T, N &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_71_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_72_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01T_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, T, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_73_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01T_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; T, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_0_74_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate.html" target="_self">negate</a></td><td class="desc"></td></tr>
+<tr id="row_0_75_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_76_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_77_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter.html" target="_self">NumericArrayConverter</a></td><td class="desc">Conversion operator for Array </td></tr>
+<tr id="row_0_78_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_012_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, 2, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;float, 2&gt; &lt;= Array&lt;half_t, 2&gt;, round to nearest </td></tr>
+<tr id="row_0_79_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
+<tr id="row_0_80_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_012_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, 2, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Partial specialization for Array&lt;half, 2&gt; &lt;= Array&lt;float, 2&gt;, round to nearest </td></tr>
+<tr id="row_0_81_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
+<tr id="row_0_82_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter.html" target="_self">NumericConverter</a></td><td class="desc"></td></tr>
+<tr id="row_0_83_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01float_00_01half__t_00_01Round_01_4.html" target="_self">NumericConverter&lt; float, half_t, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
+<tr id="row_0_84_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Specialization for round-to-nearest </td></tr>
+<tr id="row_0_85_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__toward__zero_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_toward_zero &gt;</a></td><td class="desc">Specialization for round-toward-zero </td></tr>
+<tr id="row_0_86_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01int8__t_00_01float_00_01Round_01_4.html" target="_self">NumericConverter&lt; int8_t, float, Round &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_87_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01T_00_01T_00_01Round_01_4.html" target="_self">NumericConverter&lt; T, T, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
+<tr id="row_0_88_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverterClamp.html" target="_self">NumericConverterClamp</a></td><td class="desc"></td></tr>
+<tr id="row_0_89_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus.html" target="_self">plus</a></td><td class="desc"></td></tr>
+<tr id="row_0_90_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_91_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_92_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_92_" class="arrow" onclick="toggleFolder('0_92_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector.html" target="_self">PredicateVector</a></td><td class="desc">Statically sized array of bits implementing </td></tr>
+<tr id="row_0_92_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1ConstIterator.html" target="_self">ConstIterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
+<tr id="row_0_92_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" target="_self">Iterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
+<tr id="row_0_92_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector_1_1TrivialIterator.html" target="_self">TrivialIterator</a></td><td class="desc"><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" title="An iterator implementing Predicate Iterator Concept enabling sequential read and write access to pred...">Iterator</a> that always returns true </td></tr>
+<tr id="row_0_93_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType.html" target="_self">RealType</a></td><td class="desc">Used to determine the real-valued underlying type of a numeric type T </td></tr>
+<tr id="row_0_94_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType_3_01complex_3_01T_01_4_01_4.html" target="_self">RealType&lt; complex&lt; T &gt; &gt;</a></td><td class="desc">Partial specialization for complex-valued type </td></tr>
+<tr id="row_0_95_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory.html" target="_self">ReferenceFactory</a></td><td class="desc"></td></tr>
+<tr id="row_0_96_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01false_01_4.html" target="_self">ReferenceFactory&lt; Element, false &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_97_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01true_01_4.html" target="_self">ReferenceFactory&lt; Element, true &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_98_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ScalarIO.html" target="_self">ScalarIO</a></td><td class="desc">Helper to enable formatted printing of CUTLASS scalar types to an ostream </td></tr>
+<tr id="row_0_99_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Semaphore.html" target="_self">Semaphore</a></td><td class="desc">CTA-wide semaphore for inter-CTA synchronization </td></tr>
+<tr id="row_0_100_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits.html" target="_self">sizeof_bits</a></td><td class="desc">Defines the size of an element in bits </td></tr>
+<tr id="row_0_101_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01Array_3_01T_00_01N_00_01RegisterSized_01_4_01_4.html" target="_self">sizeof_bits&lt; Array&lt; T, N, RegisterSized &gt; &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
+<tr id="row_0_102_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01bin1__t_01_4.html" target="_self">sizeof_bits&lt; bin1_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for bin1_t </td></tr>
+<tr id="row_0_103_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01int4b__t_01_4.html" target="_self">sizeof_bits&lt; int4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for int4b_t </td></tr>
+<tr id="row_0_104_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint1b__t_01_4.html" target="_self">sizeof_bits&lt; uint1b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint1b_t </td></tr>
+<tr id="row_0_105_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint4b__t_01_4.html" target="_self">sizeof_bits&lt; uint4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint4b_t </td></tr>
+<tr id="row_0_106_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sqrt__est.html" target="_self">sqrt_est</a></td><td class="desc"></td></tr>
+<tr id="row_0_107_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1SubbyteReference.html" target="_self">SubbyteReference</a></td><td class="desc"></td></tr>
+<tr id="row_0_108_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Tensor4DCoord.html" target="_self">Tensor4DCoord</a></td><td class="desc">Defines a canonical 4D coordinate used by tensor operations </td></tr>
+<tr id="row_0_109_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorRef.html" target="_self">TensorRef</a></td><td class="desc"></td></tr>
+<tr id="row_0_110_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorView.html" target="_self">TensorView</a></td><td class="desc"></td></tr>
+<tr id="row_0_111_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits.html" target="_self">TypeTraits</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_112_" class="arrow" onclick="toggleFolder('0_112_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; double &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1integer__type.html" target="_self">integer_type</a></td><td class="desc"></td></tr>
+<tr id="row_0_112_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1unsigned__type.html" target="_self">unsigned_type</a></td><td class="desc"></td></tr>
+<tr id="row_0_113_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01float_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; float &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_114_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_115_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half__t_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half_t &gt; &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_116_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01double_01_4.html" target="_self">TypeTraits&lt; double &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_117_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01float_01_4.html" target="_self">TypeTraits&lt; float &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_118_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01half__t_01_4.html" target="_self">TypeTraits&lt; half_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_119_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int_01_4.html" target="_self">TypeTraits&lt; int &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_120_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int64__t_01_4.html" target="_self">TypeTraits&lt; int64_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_121_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int8__t_01_4.html" target="_self">TypeTraits&lt; int8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_122_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint64__t_01_4.html" target="_self">TypeTraits&lt; uint64_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_123_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint8__t_01_4.html" target="_self">TypeTraits&lt; uint8_t &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_124_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01unsigned_01_4.html" target="_self">TypeTraits&lt; unsigned &gt;</a></td><td class="desc"></td></tr>
+<tr id="row_0_125_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1xor__add.html" target="_self">xor_add</a></td><td class="desc">Fused multiply-add </td></tr>
+<tr id="row_1_"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_1_" class="arrow" onclick="toggleFolder('1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><b>std</b></td><td class="desc">STL namespace </td></tr>
+<tr id="row_1_0_" class="even" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structstd_1_1numeric__limits_3_01cutlass_1_1half__t_01_4.html" target="_self">numeric_limits&lt; cutlass::half_t &gt;</a></td><td class="desc">Numeric limits </td></tr>
+<tr id="row_2_" class="even"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugType.html" target="_self">DebugType</a></td><td class="desc"></td></tr>
+<tr id="row_3_"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugValue.html" target="_self">DebugValue</a></td><td class="desc"></td></tr>
+</table>
+</div><!-- directory -->
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma_8h.html
+++ b/docs/arch_2mma_8h.html
@ -0,0 +1,156 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Templates exposing architecture support for multiply-add operations.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="numeric__types_8h_source.html">cutlass/numeric_types.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm50_8h_source.html">cutlass/arch/mma_sm50.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm60_8h_source.html">cutlass/arch/mma_sm60.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma__sm61_8h_source.html">cutlass/arch/mma_sm61.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="mma__sm70_8h_source.html">cutlass/arch/mma_sm70.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="mma__sm75_8h_source.html">cutlass/arch/mma_sm75.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma_8h__incl.png" border="0" usemap="#mma_8h" alt=""/></div>
+<map name="mma_8h" id="mma_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma_8h__dep__incl.png" border="0" usemap="#mma_8hdep" alt=""/></div>
+<map name="mma_8hdep" id="mma_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma.html">cutlass::arch::Mma&lt; Shape_, kThreads_, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma_8hdepincl.md5
+++ b/docs/arch_2mma_8hdepincl.md5
@ -0,0 +1 @@
+7d16b59e6ba0442b8a275a213d5da3a6
--- a/docs/arch_2mma_8h__incl.md5
+++ b/docs/arch_2mma_8h__incl.md5
@ -0,0 +1 @@
+d1fff3f9d55a262110aa6a456caa91e0
--- a/docs/arch_2mma_8h_source.html
+++ b/docs/arch_2mma_8h_source.html
--- a/docs/arch_2mma__sm50_8h.html
+++ b/docs/arch_2mma__sm50_8h.html
@ -0,0 +1,176 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm50.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm50.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="arch_2mma_8h_source.html">cutlass/arch/mma.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="complex_8h_source.html">cutlass/complex.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm50.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm50_8h__incl.png" border="0" usemap="#mma__sm50_8h" alt=""/></div>
+<map name="mma__sm50_8h" id="mma__sm50_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm50_8h__dep__incl.png" border="0" usemap="#mma__sm50_8hdep" alt=""/></div>
+<map name="mma__sm50_8hdep" id="mma__sm50_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm50_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm50_8hdepincl.md5
+++ b/docs/arch_2mma__sm50_8hdepincl.md5
@ -0,0 +1 @@
+988e6466c703c4e63c9a889b8c3c54b5
--- a/docs/arch_2mma__sm50_8h__incl.md5
+++ b/docs/arch_2mma__sm50_8h__incl.md5
@ -0,0 +1 @@
+03f1613fdffbd6e7575de0d2967d08bf
--- a/docs/arch_2mma__sm50_8h_source.html
+++ b/docs/arch_2mma__sm50_8h_source.html
--- a/docs/arch_2mma__sm60_8h.html
+++ b/docs/arch_2mma__sm60_8h.html
@ -0,0 +1,157 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm60.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm60.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &lt;cuda_fp16.h&gt;</code><br />
+<code>#include &quot;<a class="el" href="arch_2mma_8h_source.html">cutlass/arch/mma.h</a>&quot;</code><br />
+<code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm60.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm60_8h__incl.png" border="0" usemap="#mma__sm60_8h" alt=""/></div>
+<map name="mma__sm60_8h" id="mma__sm60_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm60_8h__dep__incl.png" border="0" usemap="#mma__sm60_8hdep" alt=""/></div>
+<map name="mma__sm60_8hdep" id="mma__sm60_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm60_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 2, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm60_8hdepincl.md5
+++ b/docs/arch_2mma__sm60_8hdepincl.md5
@ -0,0 +1 @@
+ba69b14e3936946092854211499ae9fa
--- a/docs/arch_2mma__sm60_8h__incl.md5
+++ b/docs/arch_2mma__sm60_8h__incl.md5
@ -0,0 +1 @@
+e820099c55f2397639bb210d76ec4c05
--- a/docs/arch_2mma__sm60_8h_source.html
+++ b/docs/arch_2mma__sm60_8h_source.html
--- a/docs/arch_2mma__sm61_8h.html
+++ b/docs/arch_2mma__sm61_8h.html
@ -0,0 +1,149 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.11"/>
+<title>CUTLASS: mma_sm61.h File Reference</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+  $(document).ready(function() { init_search(); });
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
+  <td id="projectalign" style="padding-left: 0.5em;">
+   <div id="projectname">CUTLASS
+   </div>
+   <div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
+  </td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.11 -->
+<script type="text/javascript">
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+</script>
+  <div id="navrow1" class="tabs">
+    <ul class="tablist">
+      <li><a href="index.html"><span>Main&#160;Page</span></a></li>
+      <li><a href="modules.html"><span>Modules</span></a></li>
+      <li><a href="namespaces.html"><span>Namespaces</span></a></li>
+      <li><a href="annotated.html"><span>Classes</span></a></li>
+      <li class="current"><a href="files.html"><span>Files</span></a></li>
+      <li>
+        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+      </li>
+    </ul>
+  </div>
+  <div id="navrow2" class="tabs2">
+    <ul class="tablist">
+      <li><a href="files.html"><span>File&#160;List</span></a></li>
+      <li><a href="globals.html"><span>File&#160;Members</span></a></li>
+    </ul>
+  </div>
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div id="nav-path" class="navpath">
+  <ul>
+<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li>  </ul>
+</div>
+</div><!-- top -->
+<div class="header">
+  <div class="summary">
+<a href="#nested-classes">Classes</a> &#124;
+<a href="#namespaces">Namespaces</a>  </div>
+  <div class="headertitle">
+<div class="title">arch/mma_sm61.h File Reference</div>  </div>
+</div><!--header-->
+<div class="contents">
+
+<p>Matrix multiply.  
+<a href="#details">More...</a></p>
+<div class="textblock"><code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
+</div><div class="textblock"><div class="dynheader">
+Include dependency graph for arch/mma_sm61.h:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm61_8h__incl.png" border="0" usemap="#mma__sm61_8h" alt=""/></div>
+<map name="mma__sm61_8h" id="mma__sm61_8h">
+</map>
+</div>
+</div><div class="textblock"><div class="dynheader">
+This graph shows which files directly or indirectly include this file:</div>
+<div class="dyncontent">
+<div class="center"><img src="arch_2mma__sm61_8h__dep__incl.png" border="0" usemap="#mma__sm61_8hdep" alt=""/></div>
+<map name="mma__sm61_8hdep" id="mma__sm61_8hdep">
+</map>
+</div>
+</div>
+<p><a href="arch_2mma__sm61_8h_source.html">Go to the source code of this file.</a></p>
+<table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
+Classes</h2></td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 4 &gt;, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 2 &gt;, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
+<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation.  <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html#details">More...</a><br /></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table><table class="memberdecls">
+<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
+Namespaces</h2></td></tr>
+<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
+<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
+</table>
+</div><!-- contents -->
+<!-- start footer part -->
+<hr class="footer"/><address class="footer"><small>
+Generated by &#160;<a href="http://www.doxygen.org/index.html">
+<img class="footer" src="doxygen.png" alt="doxygen"/>
+</a> 1.8.11
+</small></address>
+</body>
+</html>
--- a/docs/arch_2mma__sm61_8hdepincl.md5
+++ b/docs/arch_2mma__sm61_8hdepincl.md5
@ -0,0 +1 @@
+1faaf1631d5f0e44d6cc6c7121e6972e
--- a/docs/arch_2mma__sm61_8h__incl.md5
+++ b/docs/arch_2mma__sm61_8h__incl.md5
@ -0,0 +1 @@
+8cce8aef2d98c4082d68734b538253c7
--- a/Show More
+++ b/Show More