Compare commits

...

693 Commits
v1.0.0 ... main

Author SHA1 Message Date
8afb19d904 update CITATION.cff 2025-10-28 23:42:37 -04:00
b2ca083d2b Fixed compilation error when using StreamK scheduler + PDL. (#2686) 2025-10-21 23:11:14 -04:00
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
e6e2cc29f5 fix (#2684) 2025-10-15 14:46:38 -04:00
c6aeb9179c Update pyproject.toml
update version to 4.2.1
2025-09-24 01:18:51 -04:00
95a5ff14c0 Update CHANGELOG.md
format change
2025-09-23 17:33:00 -04:00
fb8b43ef05 Merge pull request #2669 from NVIDIA/421_update
4.2.1 update
2025-09-23 14:02:29 -07:00
f874df19ac 4.2.1 update 2025-09-23 13:45:13 -07:00
7a6d4ee099 v4.2.1 update. (#2666) 2025-09-23 13:25:43 -04:00
GTO
2b8dff1f90 Fix bfloat16 epsilon (#2607)
* Fix bfloat16 epsilon

* just use constants

---------

Co-authored-by: Konstantin <konstantin@MacBook-Air.local>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-09-21 23:43:59 -04:00
fd0312ddf6 Remove duplicate function calls (#1584) 2025-09-21 23:16:59 -04:00
64579189ec Feature/add bottom causal mask (#2480)
* Rebase to latest

* update

* upd

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Update fmha_fusion.hpp

* Update fmha_fusion.hpp

fixed flipped logic for isQBegin

* Update fmha_fusion.hpp

* Avoid use of booleans

The current expression is confusing

* fmt

* Update fmha_fusion.hpp

Reproduce error/fix with: 
./77_blackwell_fmha_fp16 --verify --b=1 --q=1013 --k=1024 --h=1 --h_k=1 --mask=causal --causal-type=qend

* add test, format

---------

Co-authored-by: Richard Cai <ricai@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-18 17:11:23 -04:00
b234a8c024 Rename python/cutlass to python/cutlass_cppgen (#2652) 2025-09-18 14:26:57 -04:00
74825181f2 Remove old-version dsl examples. (#2644) 2025-09-17 22:23:30 -04:00
8825e8be4f Add required changes for github pipeline. (#2648) 2025-09-17 22:22:45 -04:00
wbn
7817e47154 Fxied a typo in pipeline descript docs. (#2623) 2025-09-15 22:32:27 -04:00
25ccb875b8 Fix: a calculation error in the example of dividing out in the 02_layout_algebra doc (#2635) 2025-09-15 22:31:33 -04:00
29c1ad704a Fix doc cute 03_tensor.md link typo (#2627)
* Update 03_tensor.md fix link typo

change path to relative path

* Update 03_tensor.md

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-15 22:26:43 -04:00
57e3cfb47a doc change for 4.2 (#2639)
* doc change

* fix broken links

* ragged gemm doc update

* move around texts about moe gemm
2025-09-15 22:02:45 -04:00
e7e0adddac Update version.h
change version number to 4.2
2025-09-15 12:40:58 -04:00
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
56f0718a97 ex77 backwards GQA (#2556)
* bwd GQA init

* Update examples/77_blackwell_fmha/77_blackwell_fmha_bwd.cu

* ref kernel type conversion fix

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-09-09 12:53:28 -04:00
76c96b0be3 Fix incorrect shapes in copy_atom doc comments. (#2575) 2025-09-04 16:57:24 -07:00
d98e7bf7ce Fix comment in mma_atom.hpp (#2579) 2025-09-04 16:56:39 -07:00
b6ccf34aef Fix Copy_Atom type mismatch in sgemm_sm80.cu (#2582) 2025-09-04 16:56:17 -07:00
2288c0c901 Fix bugs in matrix.h (#2598) 2025-09-04 16:55:11 -07:00
b2dd65dc86 more robust imports in heuristics.py and heuristics_provider.py (#2596) 2025-08-28 22:32:55 -04:00
496654bf2c Fix sm100 gemm wrong static constexpr that breaks compilation on Windows (#2167)
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows

* More Windows fixes

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

* Revert "More Windows fixes"

This reverts commit 2e8cfc1382.

---------

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
2025-08-28 22:13:00 -04:00
9ca7e877b2 fix gqa issue for blackwell fmha.py (#2599) 2025-08-28 11:15:20 -04:00
a49a78ffef v4.2 release. (#2587)
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.

* v4.2 release.
2025-08-22 18:11:24 -04:00
11cad1f67b fix a typo. (#2561) 2025-08-19 22:23:09 -04:00
931359cec1 Fix typo in functional.h (#2571) 2025-08-19 22:22:31 -04:00
42e7c546c4 Add movmatrix support (movmatrix.sync.aligned.m8n8.trans.b16) (#2562) 2025-08-19 22:22:02 -04:00
ec18e8043b Make swizzle in pycute work (#2553) 2025-08-19 22:21:00 -04:00
5b76420d6a [DOC] Add more exposition to composition example (#2536)
* Add more exposition to composition example

* Apply suggestions from code review

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2025-08-11 22:20:36 -04:00
19772cd63e Fix typo in smem_allocator.py (#2517) 2025-08-10 22:44:22 -04:00
052afcd314 fix typo (#2529) 2025-08-10 22:44:02 -04:00
86cf63e2d4 NIT: Grammar (#2537) 2025-08-10 22:42:45 -04:00
a267d47f9b Update batched_gemm.cu (#2538) 2025-08-10 22:42:21 -04:00
9e6ab77d27 Fix a copy error in the SM70 main loop when loading data from smem to rmem (#2540) 2025-08-10 22:42:01 -04:00
d0eada85a3 Support both CUDA 12 and 13 cccl header locations (#2543) 2025-08-10 22:41:25 -04:00
23139309e9 Fix incorrect K dim in CuTe MMA Atom doc. (#2544) 2025-08-10 22:40:56 -04:00
6dd13d4278 Facebook:This commit makes its files safe for use with -Wimplicit-fallthrough. (#2324) 2025-07-31 20:55:19 -04:00
3b054767b3 Fix typo (#2514) 2025-07-30 22:14:54 -04:00
6fb5e667c1 [Doc fix] incorrect compute cap. for Blackwell RTX (#2511)
Blackwell RTX is compute capability 12.0 (SM120) but incorrectly listed
as SM100 in the README.
2025-07-30 22:14:13 -04:00
6c891db9f6 Fix epilogue:🧵:Convert cannot be used with cute::collective::DefaultEpilogue. (#2333) 2025-07-30 22:12:53 -04:00
da47886e34 Fix example bug (#2351) 2025-07-30 22:12:33 -04:00
26b7450023 support fp16 accmulator for sm89 fp8 mma (#2378)
* add support for sm89 in cute and the unit tests

* support fp16 accmulator for sm89 fp8 mma

* format code
2025-07-30 22:12:08 -04:00
a39cf6b511 Fix example in CuTe tutorials (#2416) 2025-07-30 22:11:47 -04:00
f09045d660 Corrected minor nit in mma_traits.hpp (#2447)
* Corrected minor nit in mma_traits.hpp

The entry and descriptions were jumbled up.

* Update mma_traits.hpp

* Update mma_traits.hpp
2025-07-30 22:11:23 -04:00
84a27b3926 fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu GridDim miscalculated (#2492)
* fix: examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu Launch dimGrid error

* feat: add cta tiler

* Update examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu

use cluster_layout_vmnk instead of cta_tiler

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>

* feat: remove cta_tiler

---------

Co-authored-by: qinghongzeng <qinghongzeng@deeproute.ai>
Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>
2025-07-30 22:11:04 -04:00
e093b4f691 Fix tutorial comment in sgemm_1.cu: use tCrC instead of tCsA in axpby explanation (#2448) 2025-07-30 22:09:55 -04:00
664c4f7b3e Update CUTLASS version to 4.1
Update CUTLASS version to 4.1.
2025-07-26 20:11:04 -04:00
0e026982ce Example 77 add blackwell fmha bwd for MLA shape (#2466)
* Update examples/77_blackwell_fmha/device/fmha_device_bwd.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

* bug fix & use existing value rather than pass one more argument to support different dim in bwd_convert

* Fix casual mask cnt when IsQBegin==false

* bug fix in casual mask backward

* code sync

---------

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>
2025-07-24 18:41:11 -04:00
9a9a579714 Merge pull request #2489 from NVIDIA/update_workflow_script
Support "CuTe DSL" auto-labeling in workflow
2025-07-23 15:33:43 +08:00
51d730b8be Support "CuTe DSL" auto-labeling in workflow 2025-07-23 00:28:01 -07:00
6c0c8b7484 1. Update bug/feature report template to add component selection. (#2485)
2. Add workflow to apply component label automatically
2025-07-22 12:38:03 -04:00
e51efbfe18 Update CHANGELOG.md 2025-07-21 22:09:56 -04:00
fd6cfe1ed0 v4.1 release update v2. (#2481) 2025-07-21 22:03:55 -04:00
9baa06dd57 Add Blackwell MLA forward (shape: d=192, dv=128) implementation in example_77 (#2472) 2025-07-18 01:27:48 -04:00
ebe98c549a cache procedural_name in GemmOperation (#2317) 2025-07-16 22:25:02 -04:00
9892624b66 Fix typos in the text (#2417) 2025-07-16 21:51:12 -04:00
a1aaf2300a v4.1 release 2025-07-03 08:07:53 -04:00
b995f93317 4.0 doc change (#2425) 2025-06-27 09:35:06 -04:00
889ff20648 v4.0 update v2. (#2420)
* Ex77 forward kernel fix.
2025-06-25 12:56:25 -04:00
dc4817921e v4.0 update. (#2398)
* Ex77 fix.
2025-06-12 09:10:29 -04:00
5c6bca0441 Update requirements.txt (#2390)
Remove the dev suffix in the wheel version
2025-06-10 02:31:49 -04:00
c2ad7c5b20 fix link in readme (#2379) 2025-06-07 07:38:38 -04:00
cc23f6d1e9 fix link (#2377) 2025-06-07 06:00:39 -04:00
5a287538c2 "Update CHANGELOG for 4.0 tagging" (#2374) 2025-06-06 10:07:36 -04:00
8bdbfca682 v4.0 update. (#2371) 2025-06-06 02:39:20 -04:00
2e2af190bd Revert "[ex77] fix mla split; add fwd lse; add bwd varlen (#2366)" (#2370)
This reverts commit f12b1d75c9.
2025-06-05 23:14:57 -04:00
f12b1d75c9 [ex77] fix mla split; add fwd lse; add bwd varlen (#2366) 2025-06-05 18:39:46 -04:00
b244379d9b Merge pull request #2359 from NVIDIA/oss_ci
Initial Workflow Definition for blossom-ci support on CUTLASS GitHub
2025-06-03 14:04:35 -07:00
9d165a3b8e Handle get_masked_trip_count for small length in fmha example (#2292)
* handle get_masked_trip_count for small length

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

* Update examples/77_blackwell_fmha/collective/fmha_fusion.hpp

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>

---------

Co-authored-by: Vijay Thakkar <vijaythakkar@me.com>
2025-05-30 22:51:18 -04:00
b9b110a9ea Correct divmod order in example 77 (blackwell fmha) (#2291)
* correct divmod naming

* order bidh/bidb
2025-05-30 22:50:40 -04:00
8206e7a0f5 Pre-compile in CuteDsl/ampere/elementwise_apply.py (#2340) 2025-05-28 10:24:39 -04:00
6316b6f867 Fix typos (#2311)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-23 08:30:10 -04:00
9354bfd7c1 Keep the documentation consistent with the sgemm_1.cu code. (#2285)
* Keep the documentation consistent with the sgemm_1.cu code.

* fix typo

---------

Co-authored-by: zky <zky@126.com>
2025-05-19 22:53:15 -04:00
5e9b8e2a25 fix docx (#2290)
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn>
2025-05-19 22:52:37 -04:00
1ec230c4bf Fix typo (#2299)
Needs == for pip to parse the file
2025-05-15 09:38:42 -04:00
f89cd95b16 Update elementwise_add.ipynb (#2298) 2025-05-15 09:38:27 -04:00
f115c3f854 Release v4.0.0 (#2294) 2025-05-13 15:55:29 -04:00
ad7b2f5e84 3.9.2 doc/version (#2279)
* 3.9.2 doc/version

* whitespace
2025-05-04 00:00:15 -04:00
40f124ef27 [CUTLASS] Add GNA to PUBLICATIONS.md (#2276)
Adds "Generalized Neighborhood Attention" to list of publications using
CUTLASS.

https://arxiv.org/abs/2504.16922

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2025-05-02 16:57:19 -04:00
89f6bf2739 Fix group scale gemm when K==128 (#2275)
Co-authored-by: Jiazhen Han <jiazhenh@nvidia.com>
2025-05-02 15:41:18 -04:00
f535c33634 3.9.1 doc/version change (#2273) 2025-05-01 00:27:00 -04:00
e3cb8a773a Import cuda, cudart, nvrtc lazily (#2251)
* Lazy cuda import

* More lazy cuda import

* More lazy cuda imports

* minor fixes

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-30 23:10:33 -04:00
c4bdfe821c Lazy scipy import (#2250) 2025-04-30 16:10:00 -04:00
b3ce7e12b7 Make cc a positional argument (#2249) 2025-04-30 16:09:25 -04:00
fe75ead92e Import pydot lazily (#2248) 2025-04-30 16:08:17 -04:00
35136f5564 Fix wrong detection of python version for use_rmm. (#2224) 2025-04-30 15:29:33 -04:00
e5b810bed1 Use cudaMemcpyAsync in gemm grouped with kRequiresPrecomputation schedule. (#2256)
Co-authored-by: Yuhang Qi <qiyuhang@bytedance.com>
2025-04-30 15:28:05 -04:00
2b78c2fe31 cherry-pick feature/hopper-blockwise-generalization-optimization (#2270) 2025-04-29 16:47:22 -04:00
697126019e fix blackwell grouped groupwise hang (#2267) 2025-04-29 11:54:20 -04:00
e94e888df3 Update CHANGELOG.md 2025-04-24 21:51:34 -04:00
be73ad20a5 Update CHANGELOG.md for 3.9 2025-04-24 16:54:06 -04:00
f02a7c2976 Update README.md for 3.9 2025-04-24 16:51:45 -04:00
331a1f5b3f cutlass 3.9 update (#2255)
* cutlass 3.9 update

* rebase

* fixes out of shared memory for blockwise Blackwell

* doc format

* fix issue 2253

* disable host ref by default

* fix sm120 smem capacity

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-24 15:42:40 -04:00
8e345c5c5b fix_missing_stdint (#2199)
* Update config.hpp

* 更新 config.hpp

* 更新 config.hpp
2025-04-23 22:21:22 -04:00
81a43e6d92 Set EpiTile correctly when TileN is not divisible by 32 (#2220)
If TileN is not divisible by 32 (e.g, 208), by default EpiTile would be set
to 128 x 32, which does not compile as TileN is required to divide EpiTileN
2025-04-21 00:02:51 -04:00
ade6376fa0 [SM90] Change register allocation for TileN=208 to avoid spills (#2219)
With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
2025-04-21 00:02:30 -04:00
bb4dd682dd Fix broken links and alt text in cluster launch control docs (#2234)
* Fix broken links in cluster launch control docs

* Improve titles and alt text
2025-04-21 00:01:12 -04:00
5e497243f7 fix: fig link in cute docs (#2216) 2025-04-10 14:51:41 -04:00
b3f3c7758c Update tile_iterator.cu (#2204)
Some typos in comments
2025-04-10 14:49:58 -04:00
9e1b649827 fix-left-inverse-for-nvcc114 (#2196) 2025-04-10 14:48:46 -04:00
5120b21cc3 suppress compilation warnings (#2195) 2025-04-10 14:48:01 -04:00
dd76dec4ef [Doc] Make C++ code more plausible (#2156)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:35:46 -04:00
19cc2a5feb add support for sm89 in cute and the unit tests (#2177)
* add support for sm89 in cute and the unit tests

* rebase v3.9 and format code

* minor fix

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 14:16:36 -04:00
09df6ac464 [Doc]fix typo (#2174)
Co-authored-by: wenju.li <wenju.li@deepctr.cn>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-04-10 12:46:53 -04:00
df8a550d39 Update mma_atom.hpp (#2159)
remove useless code
2025-04-03 11:42:10 -04:00
79fc51f4b8 v3.9 update (#2213)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-03 02:10:16 -04:00
6f4921858b v3.9 update (#2203)
* v3.9 update

* voidD

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-04-02 15:11:18 -04:00
62750a2b75 v3.9 (#2185)
* v3.8 update x

* fix blackwell gg

* doc change

* doc change

* doc change

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:52:23 -04:00
8c4d1dc47d Treat negative zero as equivalent to positive zero in sm90_sparse_gemm_compressor.hpp (#2110)
* Treat negative zero as zero in the sparse gemm compressor

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* format

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* Apply patch

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* sm90_sparse_gemm_compressor.hpp

* test/unit/transform/CMakeLists.txt

* test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp

* include/cutlass/numeric_types.h

---------

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2025-03-21 01:44:17 -04:00
3fe62887d8 adding blackwell (#2143) 2025-03-17 22:20:40 -04:00
bd03b22f64 fix typo (#2136)
Co-authored-by: XiaoDong <xiaod@nvidia.com>
2025-03-17 22:19:43 -04:00
6c6b78550e Fix SM90 beta=1 hang and stream-K launch errors (#2172)
* Fix stream-K occupancy calculation

* Fix beta=1 hang
2025-03-13 14:07:37 -04:00
06e560d98a Blockwise/Groupwise kernel improvement and programatic dependent launch enablement (#2161)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-03-10 14:36:11 -04:00
df18f5e4f5 Improvements for: Groupwise scaling along M for FP8 gemm (#2095)
* fix blockwise fp8 kernels

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* wip, < 128 not working

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix < 128

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* reduce diff

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* review comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* support partial n blocks

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* fix build errors

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-27 22:39:29 -05:00
ca4fdbea70 Blockwise and Groupwise GEMM for Blackwell and Improvements for Hopper (#2139)
- Blockwise and Groupwise GEMM improvements for Hopper.
- Blockwise and Groupwise GEMM for Blackwell.
- Blockwise Grouped GEMM for Hopper.
- Static ScalePromotionInterval for Hopper FP8 GEMMs.

Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2025-02-26 12:44:58 -05:00
eefa171318 [EVT] Fix Row/Col broadcast with array arguments (#2120)
* Use constexpr in if to prevent invalid comparison.

* Move constexpr check into else scope.
2025-02-21 17:47:30 -05:00
afa1772203 truncate name for cutlass profiler (#2124)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-21 00:16:56 -05:00
9b3772dfa6 Hopper Grouped GEMM support for FP8 Accum (#2123)
* Add support for fp8accum, with profiler extension

* Update .gitignore

* contri

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-02-20 21:55:26 -05:00
b84e9802d8 update 3.8 v2 (#2112)
* update 3.8 v2

* update 3.8

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-19 22:03:14 -05:00
e9627ce55b Always use cudaGetDriverEntryPoint with CUDA 12 (#2086)
`cudaGetDriverEntryPointByVersion` has been added to drivers in 12.5, but we don't know at compile time the driver version.
In particular, we can build with nvcc 12.8 for a 12.2 driver for instance, and this was causing the following error:

```
undefined symbol: cudaGetDriverEntryPointByVersion,
```
2025-02-11 13:04:25 -05:00
ad6e1ec19c Add ParetoQ to PUBLICATIONS.md (#2089) 2025-02-10 16:47:02 -05:00
0642d46dd4 Update 0x_gemm_tutorial.md (#2090) 2025-02-10 16:46:43 -05:00
833f6990e0 v3.8.0 update (#2082)
* 3.8 update

* fix Markus' name

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-02-06 21:33:40 -05:00
affd1b693d [EVT] Add support for Row/Col broadcast PtrArray (#2033)
* Add group support to EVT row/col broadcast.

* small modifications

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-02-02 12:10:07 -05:00
6f55278121 bugfix generic-k code in top-k with softmax (#1993)
* bugfix generic-k code in top-k with softmax

* Update include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

* Update examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>

---------

Co-authored-by: Ali Hassani <68103095+alihassanijr@users.noreply.github.com>
2025-01-31 19:05:35 -05:00
3c28697b9f Groupwise scaling along M for FP8 gemm (#2037)
* FP8 groupwise scaling along M

* small updates

---------

Co-authored-by: zl <zl@deepseek.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-31 13:51:28 -05:00
bdd641790a Update README.md 2025-01-28 18:08:13 -05:00
cc19d4d22b fix a readme broken link (#2069) 2025-01-28 18:03:34 -05:00
47daa33c61 fix cuda 12.6 issues (#2066) 2025-01-28 17:28:29 -05:00
389e493055 CUTLASS 3.8 Release (#2059)
* CUTLASS 3.8 Release

* update

* Update README.md

* Revert "Update README.md"

This reverts commit b353e36fe8.

* update

* update

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-25 02:44:06 -05:00
9eb01fa0b0 update 3.7 docs (#2051)
* update docs

* update docs

* update docs

* update docs

* update docs

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-23 15:13:50 -05:00
b78588d163 CUTLASS 3.7 (#2045)
* CUTLASS 3.7

* clean up changelog

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-18 09:53:07 -05:00
902dff3663 fix assertion in integer_subbytes.h (#1961) 2025-01-09 22:47:58 -05:00
ef5620dd1d Blockwise Scaling for FP8 (#1932)
* F8 Blockwise Scaling

* two more NumProducerThreadEvents

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-09 11:22:09 -05:00
375e284e6a Add Line Break (#2020) 2025-01-08 23:46:59 -05:00
52b35e90ce Fix Typos (#2021)
* Fix Typo

* Fix Typo
2025-01-08 23:46:28 -05:00
24f991e879 Fix typo in library_defaults.py (#2024) 2025-01-08 15:44:11 -05:00
51b25e7b58 Add vector-types back to platform.h (#2026) 2025-01-08 15:31:59 -05:00
ZZK
7de6a59784 Add half->int8 saturate conversion to promise valid range (#1983)
* Add half->int8 saturate conversion to promise valid range

* add gpu only macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2025-01-08 09:01:07 -05:00
c506e16788 fix mem fence (#2030)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2025-01-07 19:02:26 -05:00
7494a180a4 fix bug: arch/mma_sm60.h Mma<2,2,1> calculate wrong (#1989) 2025-01-06 22:05:12 -05:00
cffd5d32b7 Update 0x_gemm_tutorial.md (#1982)
Shouldn't this be BLK_M, BLK_**K**, k
2025-01-06 22:04:35 -05:00
bf9da7b76c Update CHANGELOG.md 2024-12-25 17:11:15 -05:00
3d261a5974 3.6.0 update (#2005)
* 3.6.0 update

* doc and swap stuff

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-12-25 01:34:40 -05:00
e1cd8c7866 Fix Typo (#1962) 2024-12-10 22:07:37 -05:00
33c584364e Fix CuTe README Typo (#1951) 2024-12-10 22:05:40 -05:00
2b6cfd34d1 fix a typo that fails the compiling when ElementScale is not the same as MmaType (#1977) 2024-12-10 15:54:44 -05:00
4c42f73fda Improve mixed dtype GEMM (#1972)
* update

* fix a typo
2024-12-06 13:33:22 -05:00
80243e0b8c add {uint4, uint2, int2} => {fp16, bf16} conversion (#1966) 2024-12-03 14:03:43 -05:00
b0e09d7cd3 Fix cutlass python library with cuda 12.6.2.post1 (#1942)
* Fix `cutlass` python library with cuda `12.6.2.post1`

Previously we had this error:
```
  File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
                       ^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```

* Update sm90_utils.py

* Update generator.py

* Update python/cutlass_library/generator.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* Update python/cutlass_library/sm90_utils.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

---------

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
2024-11-18 09:06:32 -05:00
8aa95dbb88 Fix the racing condition of mixed-input gemm when writing the registers (#1931)
* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-11-08 13:15:54 -05:00
d656afbd2a fix undefined in device code error (#1880) 2024-11-06 14:56:54 -05:00
32e3c38aef remove restriction of stride == kernel in nhwc_pooling (#1896) 2024-11-06 14:54:53 -05:00
9004ed2d1b Update publications (#1912) 2024-11-06 14:54:15 -05:00
19f51596e8 feat: support kFactor 8 used in mma tensor op tile iterator (#1512) 2024-10-29 11:56:59 -04:00
e8a8b69365 Refactor some GroupedGEMM logic (#1899) 2024-10-25 20:14:01 -04:00
08a49953a0 Add a print for the uint{x}b_t type. (#1871) 2024-10-24 14:39:22 -04:00
a424ca6cf9 fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856)
* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>
2024-10-24 14:38:35 -04:00
be692b48b0 remove redundant hardcoded packing configs in mixed dtype gemm (#1894)
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-10-23 14:24:09 -04:00
12626bcfe4 Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569)
fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
2024-10-23 12:56:36 -04:00
f02913c34e Include of regular_tile_iterator.h fixed for NVRTC (#1765)
* Include of regular_tile_iterator.h fixed for NVRTC

* More include fixed for NVRTC
2024-10-23 12:55:59 -04:00
03e3bffaec Adjusting code indentation (#1639) 2024-10-23 12:55:02 -04:00
e5f3caf145 Fix README (#1658)
* Fix README

* Improve README

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:52:43 -04:00
83ae20c740 added mapping for bf16 to torch::kBFloat16 (#1843)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:48:31 -04:00
b0c09ed077 fix by adding public (#1753) 2024-10-23 12:45:58 -04:00
ea69cc2849 fix typo (#1853) 2024-10-23 12:45:28 -04:00
f3a3bfcbf2 add maximum support (#1833) 2024-10-23 12:44:56 -04:00
d65266a868 Add all supported GMMA shapes (#1890) 2024-10-22 18:13:36 -04:00
5b50a8faaf Add GMMA shape m64n40k16 (#1864) 2024-10-21 20:41:47 -04:00
08101d9d0c Improve sm90 mixed dtype kernel (#1883) 2024-10-17 20:06:38 -04:00
755194a7bd add is_last_tile 2024-10-17 12:11:02 -07:00
53668799b2 Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803) 2024-10-14 19:46:20 -04:00
cc3c29a81a CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00
0837a2a00a Fix typo in comment (#1787) 2024-10-07 12:39:59 -04:00
477a677317 Fix typos in test/unit/conv/cache_testbed_output.h (#1652)
Co-authored-by: Alexander Zinoviev <azinoviev@tesla.com>
2024-10-07 12:39:11 -04:00
b27c49e84a Fix cute doc (#1529) 2024-10-07 12:38:32 -04:00
e2b0789927 Add some can implement rules of hopper convolution. (#1835) 2024-09-25 11:28:10 -04:00
44dae8b90e Adjust profiler space for SM89 (#1553) 2024-09-19 11:40:30 -04:00
2991ce18d3 Add print_svg for mma (#1733)
* add print_svg for mma

* correct the code indentation
2024-09-18 10:37:24 -04:00
1ebda1ccef Fix MMA promotion interval assertions (#1641) 2024-09-16 12:38:42 -04:00
9f68995de5 add publication: ‘EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree’ (#1526)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-09-16 11:55:09 -04:00
3a8c01a18b Prefix a member template name with the template keyword. (#1796)
Fixes llvm buld error.
2024-09-11 13:33:56 -04:00
dbdae514e0 Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795) 2024-09-11 00:07:31 -04:00
21d0534167 fix assertion (#1790) 2024-09-09 14:05:27 -04:00
323c8170bf Support ComputeFn where output type differs from input type (#1771)
This is useful for e.g. function taking in 2 float inputs and turn them to complex
2024-09-05 23:25:03 -04:00
82f5075946 set_slice3x3 -> set_slice_3x3 (#1784) 2024-09-05 23:24:10 -04:00
06e337758d Remove extraneous comma in declaration (#1776) 2024-09-05 17:14:15 -04:00
7369adcaca Add Sm90LinCombPerColBias (#1774)
Co-authored-by: Jiayu Sun <jiayus@s4124-0071.nvidia.com>
2024-09-04 15:11:24 -04:00
6c3044136b Update barrier.h (#1782) 2024-09-04 14:52:11 -04:00
e1976daacc Add support for mixed 4-bit/8-bit data types GEMM (#1413)
* Add support for mixed 4-bit/8-bit data types GEMM

* fix ( and )

---------

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-08-29 23:11:06 -04:00
f7b19de32c minor fix for a double quote in CMakeLists.txt (#1727) 2024-08-19 22:21:42 -04:00
4dbf5dbed2 Use CUDA runtime API to retrieve function pointer to driver API (#1700)
* Query pfn to driver api

* use default for older toolkits

---------

Co-authored-by: shunfans <shunfans@nvidia.com>
2024-08-19 13:26:09 -04:00
f93a69134e Merge pull request #1714 from NVIDIA/u128_div
fix uint128
2024-08-16 07:14:59 -05:00
3f084f7f3c Add couple configs into generator.py for mixed input MM (#1350)
* Add couple configs into generator.py for mixed input MM

* change one unit test name; reenable 128x32 in the profiler

* Added U8/BF16 tests.

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-08-16 00:59:29 -04:00
b0296bf682 fix uint128 2024-08-15 21:06:01 -07:00
865be73a97 Merge pull request #1713 from NVIDIA/351_sparse_update
update 3.5.1 readme/changelog
2024-08-15 11:44:49 -05:00
8d8cfdf375 update 3.5.1 readme/changelog 2024-08-14 21:12:44 -07:00
eqy
fb170439e8 Update half.h (#1709) 2024-08-14 14:59:59 -04:00
4e5a8f6853 3.5.1 plots and updated readme (#1708)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-12 18:55:55 -04:00
7192f4ab23 Add CLayout_64x208 (#1680)
Without this I get compilation error when the extended shapes are enabled
2024-08-08 14:00:24 -04:00
2049c6c5a2 5476 cutlass 3x gemm kernels (#1695)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-08 13:56:23 -04:00
e22ba590cd support data type w2 used in cutlass_library (#1517) 2024-08-06 11:15:18 -04:00
19b4c5e065 Fix isnan namespace qualification in cutlass/functional.h (#1679)
* Fix unrelated MSVC build warnings

* Fix use of isnan in functional.h

Correct namespace qualification of isnan in functional.h
so that it invokes cutlass::isnan for half_t, instead of
converting half_t to float and invoking std::isnan (on host,
or ::isnan on device).
2024-08-05 14:28:13 -04:00
06b21349bc 1x1x1 cluster launch (#1673) 2024-08-01 12:20:28 -04:00
eee0cab26c Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665)
Adds 128x256 tile shapes to FP16/BF16 and FP8 generators.
Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators.

NOTE: it is important to set kernel filter (--kernels /
CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get
all of the new configurations.

If profiling exhaustively, they can be set to `*`.

Number of CUTLASS 3.X GEMMs before this commit: 2868
Number of CUTLASS 3.X GEMMs after this commit: 4016

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2024-07-31 20:22:29 -04:00
36cbfcf483 Add extended wgmma shapes for all data types (#1666) 2024-07-31 18:33:14 -04:00
1f2b590da6 Skip void-C kernels in the profiler when beta is non zero (#1661)
* Skip void-C kernels in the profiler when beta is non zero

CUTLASS profiler will only skip disposition for void-C kernels when beta
is non zero, when it makes more sense to skip running it in the first
place.

Not all users are aware of void-C kernels (as far as I know it wasn't a
thing in 2.X), and not everyone remembers to filter out voidC kernels
when running the profiler with a non zero beta.

The easiest solution (and as far as I can tell correct way of handling this)
is that `can_implement` return `false` when beta is non zero (or
whatever argument indicates an epilogue source) but we have a void-C
kernel.

Profiler already includes functionality to skip running kernels that
fail `can_implement`.

* Move checks to collectives instead

---------

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2024-07-31 18:11:58 -04:00
8b2a0408bd Profiler docs and argument update for raster order (#1667) 2024-07-31 16:40:10 -04:00
eqy
fbd116c0e5 fix build on SM 5.2 (#1664) 2024-07-31 09:54:57 -04:00
5b283c872c Add more GMMA shapes (#1630)
* Add more GMMA shapes

* Add more shapes for BF16
2024-07-29 19:09:51 -04:00
be60a0b272 CUTLASS 3.5.1 (#1623)
* CUTLASS 3.5.1

* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
56b46e2d13 Fix grouped gemm invalid memory access to problem shapes (#1543) 2024-07-10 11:55:22 -04:00
52fb43f30f fix mbarrier invalidate (#1494) 2024-07-10 11:35:26 -04:00
843adf0408 Fix SMEM index for C in CuTe examples (#1477) 2024-07-10 11:14:15 -04:00
e48c7618e4 [bug] fix device thread gemm.h constructor (#1473) 2024-07-10 11:12:36 -04:00
c5239d8312 Add Faster Neighborhood Attention to pubs (#1471) 2024-07-10 11:09:13 -04:00
d6580c3dc0 Support use of external/system GTest installation (#1469)
* Support use of system/external GTest installation

* Create working directory for tests explicitly
2024-07-10 11:07:57 -04:00
81b06ee0e0 Fix B operand variable name and comments (#1458) 2024-07-10 11:06:29 -04:00
dbfced05e7 Fix typos in convolution tests (#1433) 2024-07-10 11:00:52 -04:00
2448bb56e6 Update gemm_api_3x.md (#1386)
Fixed what it seems to be an obvious typo.
2024-07-10 10:59:02 -04:00
637b159063 Fix C++17 version detection in helper_macros.hpp (#1479)
* It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus

* Fixed typo.

* Oops, another typo.

* Changed incorrect logic, ifndef to ifdef

* Define CUTLAS_CPLUSPLUS for language version testing

Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>

---------

Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>
2024-05-28 11:00:51 -04:00
033d9efd2d [Documentation] Fixes the confusion between concatenated vs. composed layout in CuTe documentation (#1498)
* Update 02_layout_algebra.md

* Update 02_layout_algebra.md
2024-05-02 15:35:12 -04:00
Sin
acc3ee18a1 Fix typos in cute docs (#1486)
* fix typos in 02_layout_algebra.md

* fix typos in 03_tensor.md
2024-05-02 15:34:36 -04:00
5c447dd84f Update packed_stride.hpp to add CUTLASS_HOST_DEVICE decorator to new functions (#1495) 2024-04-19 12:07:57 -04:00
7d49e6c7e2 Updates for CUTLASS 3.5.0 (#1468) 2024-04-11 21:33:40 -04:00
a40e08e9d5 Update 02_layout_algebra.md (#1451)
change line 348 to reflect correct layout.
2024-04-10 10:57:57 -04:00
lzw
8e7d9f483d add missing header for size_t in numeric_types.h (#1420)
* add missing header for size_t in `numeric_types.h`

* make nvrtc happy

* add missing header for int types in `cutlass/arch/memory.h`

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-04-09 14:15:48 -04:00
19f3cc33f1 Fix uint128 operator add (#1400)
* fix uint128 operator add for 64-bit hilo implemenation

* add uint128 test for operator add

* make clang happy

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-04-02 13:32:18 -04:00
f9ece1b42c Python Gemm tile_descriptions fix (#1439)
* fix python gemm tile descriptions

* fix formatting

* fix math_operation filtering

* fix formatting
2024-03-30 09:00:46 -04:00
28cbacbf64 fix stride compilation warning (#1415) 2024-03-29 23:50:33 -04:00
8f7d2789b8 [NFC] improve doc: fix typo in mma doc (#1417) 2024-03-27 14:07:20 -04:00
c4e3e122e2 group gemm set stride L = cute::Int<0> (#1416) 2024-03-20 17:31:14 -04:00
629f4653c3 CUTLASS 3.5.0 (#1411) 2024-03-19 17:51:04 -04:00
ffa34e7075 (NFC) improve doc: Add missing verb to sentence (#1377)
Co-authored-by: lorenzo chelini <lchelini@nvidia.com>
2024-03-04 15:30:10 -05:00
a8f2c80db0 fix tile_size(TiledCopy<Args...> const&) error (#1357) 2024-02-24 00:33:01 -05:00
bbe579a9e3 Updates for CUTLASS 3.4.1 (#1346)
* Updates for CUTLASS 3.4.1

* minor epi change
2024-02-15 15:48:34 -05:00
47a3ebbea9 Add a missing platform include (#1328) 2024-02-03 01:30:32 -05:00
57e01e1a6b Fix missing include file (#1318) 2024-02-03 01:29:32 -05:00
6e3df975a2 Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu (#1325) 2024-01-31 21:41:30 -05:00
8825fbf1ef fix unrecognized print format specifier for int8/uint8 (#1303)
* fix unrecognized print format specifier for int8/uint8

* use c++ static_cast instead of c cast style
2024-01-29 21:22:40 -05:00
092f14db05 fix tile_size_mnk compilation warning (#1294) 2024-01-29 21:21:15 -05:00
9385141f19 Update PUBLICATIONS.md
ptq paper from goog
2024-01-19 14:17:55 -05:00
b4b5b11070 Update PUBLICATIONS.md
add odyssey llm paper from metuan
2024-01-18 10:30:21 -05:00
139b93db61 update publications (#1308) 2024-01-17 14:06:46 -05:00
ca37d632c9 Remove sparse GEMM with row broadcasted bias vector (#1302)
This reverts commit d3e72719b4.

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
2024-01-17 14:06:27 -05:00
362abbf274 Support ElementD to be void for tma (#1153)
* Support void D with AuxStore

* refine get_element_aux
2024-01-16 18:15:42 -05:00
751eb9a885 Update license year (#1306) 2024-01-16 14:37:22 -05:00
2f589ffa76 Updates for 3.4 release. (#1305) 2024-01-16 13:42:51 -05:00
acba5beee5 Fix flops calculation and tensor b stride calculation in the example 36 (#1278)
* Fix flops calculation and tensor b stride calculation in the example 36

* Fix datatype

* Update gather_scatter_fusion.cu
2024-01-08 17:27:30 -05:00
74d1f3e63a Fix cute::array<T, 0> iterator (#1273) 2024-01-08 17:10:09 -05:00
8ac2edc810 expose stream API in python kernel call interfaces (#1287)
* expose stream API in python kernel call interfaces

* add stream to ReductionArguments; document stream arg

* add stream argument to GemmGroupedArguments
2024-01-05 08:27:45 -05:00
d4be5ab5d7 Allow per-column bias in EpilogueTensorBroadcast (#1275)
* Allow per-column bias in EpilogueTensorBroadcast

EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.

It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.

* Add unit test for EpilogueTensorBroadcast with per-col bias

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
2024-01-04 12:48:31 -05:00
c9591a694d fix typo (#1279) 2024-01-04 12:41:39 -05:00
5c756eb774 Add support for sparse GEMM with visitor epilogue (#1189)
* Add support for sparse GEMM with visitor epilogue

* Refactor changes at the kernel level
2024-01-04 12:38:11 -05:00
8236f30675 CUTLASS 3.4.0 (#1286)
* CUTLASS 3.4.0

* Update CHANGELOG.md

---------

Co-authored-by: Pradeep Ramani <prramani@nvidia.com>
2023-12-29 15:21:31 -05:00
b7508e3379 Fix inline ptx escaping for predicates. (#1264)
* Fix inline ptx escaping for predicates.

Prevents `error: invalid % escape in inline assembly string` when compiling with clang.

* More double-quoting.
2023-12-14 11:16:15 -05:00
f60786b536 Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258)
Currently, the default constructor of
`PredicatedTileAccessIteratorParams` will invoke undefined behavior in
its invocation of the `initialize` function. Specifically, it will
attempt to read from the uninitialized variables
`desc.element_size_bits` and `desc.advance_rank`. This commit changes
the default constructors of both `*Params` and `*Desc` to
zero-initialize all uninitialized members.
2023-12-11 23:01:53 -05:00
30ec1a4649 Use size_t index to iterate up to std::vector::size() (#1251)
Fixes a different signedness compare warning.
2023-12-09 08:44:31 -05:00
e1483d5fa0 Collection of changes to fix clang build. (#1200)
* Remove unused variables

* Qualify calls to make_fragment_? from templated base class.

Fixes clang build error.

* Add missing `#include <cstdio>`

* Various changes to fix clang compile errors.

* More changes to fix clang build.

Remaining issues:

- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.

* Fix remaining clang build issues.

* Qualify `cute::rank()` calls.

* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.

* Double-escape special registers in inline asm.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-08 14:42:12 -05:00
f4a0216601 Fix bug in single source GEMM with residual + streamk (#1249)
Followup to #1224.

A change in the stream-k threadblock swizzle ctor since 3.3 breaks
single source GEMM with fused epilogue and stream-k. Multi-source was
already corrected.

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-07 11:12:02 -05:00
f188f9b709 Fix typo in quickstart.md (#1257) 2023-12-07 09:49:52 -05:00
9c9b51d35c Update PUBLICATIONS.md 2023-12-07 00:02:36 -05:00
a75b4ac483 Fix Stream-K reduce bug in epilogue with broadcast (#1224)
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-05 15:35:41 -05:00
e9e30c2304 Updates and Bug fixes to CUTLASS 3.3 (#1232) 2023-12-05 09:50:49 -05:00
4a1709e17e Fixed illegal PTX syntax (#1225) 2023-12-01 12:29:48 -05:00
bef1fbcbe6 Add missing #include <cstdio> (#1197)
* Add missing `#include <cstdio>`

* move to non nvrtc part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-01 11:58:53 -05:00
2375a07d01 Qualify calls to make_fragment_? from templated base class. (#1196)
Fixes clang build error.
2023-12-01 09:52:57 -05:00
60c8251b72 Remove unused variables (#1195) 2023-12-01 09:52:19 -05:00
10b850f9c7 Fix some sign conversion warnings (#1172)
* Fix sign conversion warnings

* Fix type conversion warnings

* Fix sign conversion warnings

* Change smem_size_ to constexpr

* clang warnings

* undo cast change

* one miss change

* missing part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-11-30 00:28:40 -05:00
99c4eebe3b Explicitly cast blockIdx to uint3 (#1192)
This works around a clang issue where blockIdx is of a different type.
2023-11-30 00:26:23 -05:00
a759e85f5f Add subclass declarations to generated files. (#1193) 2023-11-30 00:25:40 -05:00
56fc3df03b Adding missing typename (#1191)
Fixes clang build failures.
2023-11-29 00:20:20 -05:00
eb01d5449d fix cp.async L2 prefetch typo (#1187) 2023-11-28 16:58:04 -05:00
8098336d51 Updates to Python interface for PyPI packaging (#1209)
* Updates

* Updates to notebooks
2023-11-28 13:52:12 -05:00
b5d8a5d9cc Allow SM90 pingpong kernel to use custom tile schedulers (#1194)
Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2023-11-15 13:45:17 -05:00
6e60b9b17c enable L2::128B prefetch for cp.async by default (#1177) 2023-11-13 13:30:13 -05:00
1ab6cc7b68 Fix std::abs overloading for bfloat16_t (#1179) 2023-11-13 13:29:45 -05:00
5ae8133cfa Doc only change changelog 3.3 (#1180) 2023-11-13 13:29:22 -05:00
39c6a83f23 fix missing return warning (#1173) 2023-11-03 22:42:59 -04:00
1d7f2a207e Fix several broken links (#1168)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-03 00:01:25 -04:00
557be3ab0e Fix several typos (#1169)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-02 23:54:46 -04:00
c008b4aea8 CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
922fb5108b clean the format (#1140) 2023-10-24 22:59:06 -04:00
7a7796afae Fix is_zero (#1147)
* Fix is_zero

* Use constexpr

* Add CUTLASS_PRAGMA_UNROLL to loops

* Avoid if branches in is_zero
2023-10-23 12:09:37 -04:00
fb10fa5308 Fix broken pipeline link in docs (#1143) 2023-10-18 12:55:46 -04:00
5e1a0a5adb fix alignmentC for h16816_s8xf16 (#1146)
* fix alignmentC for h16816_s8xf16

* manish's change
2023-10-17 15:15:39 -04:00
757275f279 Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132)
* Adding more tiles in the cutlass_library for mixed-input support.

* fix rebase issue

* more tiles to upcast a
2023-10-13 11:33:15 -04:00
fa8dfe631f fix missing return warning for repeat and axpby (#1124) 2023-10-12 00:05:45 -04:00
112590114d Add config.yml issue template with Discord link. (#1135) 2023-10-10 12:13:04 -04:00
ff02da2667 Fx parallel split-k (#1116) 2023-10-06 12:02:40 -04:00
4082fed85a Add missing int64 and uint64 overloads for conj (#1127)
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
2023-10-05 20:01:44 -04:00
5f13dcad78 set kIsHeavy member variables (#1012)
* set kIsHeavy member variables

* correct kIsHeavy value for Tanh

* set kIsHeavy=false for HardSwish

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-10-04 12:38:36 -04:00
61a38f83dc Add #include <limits> to platform.h (#1121)
Closes #1118
2023-10-02 21:41:25 -04:00
ff61a49dd1 Allow changing epsilon parameter in RMS norm kernel (#1112) 2023-10-02 20:40:28 -04:00
26986bbc60 Fix type typo in rmsnorm (#1119)
Initially the variable `h4` is `half4`, but its last two fields are not used. Based on the semantics and the context, I believe it should be `half2`.
2023-10-02 20:40:04 -04:00
7d8317a63e Support for Mixed Input TensorOp (#1084)
* Passing warp-level mixed input F16*(S8/U8) tests

* passing device-level mixed input F16*(S8/U8) tests

* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)

* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)

* Speedup reference compilation (REVERT THIS COMMIT)

* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)

* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs

* BF16 * S8 (142 TFLOPs)

* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]

* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast

* Add device-level test and profiler support for upcast on operand A

* Move shfl before the cvt and reduce #shfls by 1/2

* fix smem_usage calculation for mixed_input types

* uncomment the stuff (getting ready for merge)

* profiler changes and mixed-input reference

* mixed input reference are in a new file

* use platform instead of std

* comments and typo only

* Use CreateGemmOperator and delete CreateMixedInputGemmOperator

* copyright for new files

* rebase follow-up
2023-09-27 11:18:30 -04:00
5cd735c48e Fix Parallel Split-K on Gemm Operation Profiler (#1109)
* Debug and fix for parallel split-k in profiler

* restore debug files and remove prints
2023-09-26 17:28:00 -04:00
67ae8e0603 Change the position of minus sign in line1549 array.h (#1091)
when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error:
cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands
Moving  operator "-" from line 1549 to 1548 can solve this error
2023-09-26 17:26:39 -04:00
14f69bddc8 [fix] fix comparison operator for integer_subbyte (#1090) 2023-09-26 17:26:12 -04:00
90d3b0fb18 CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
e0aaa3c3b3 fix GmmaDescriptor print format string error (#1102) 2023-09-19 23:27:58 -04:00
8783c41851 Replace 0x1f with 0xffffffff in __shfl_sync (#1097)
This fixes compatibility with H100 and resolves #1094
2023-09-18 19:58:19 -04:00
6407bcdf0a fix matrix B indices (#1089) 2023-09-12 14:04:18 -04:00
a77b2c9cb8 style(examples): typo (#1080)
* Update ampere_tensorop_conv2dfprop.cu

learning cutlass, PR a typo.

* Update ampere_gemm_operand_reduction_fusion.cu
2023-09-11 10:13:22 -04:00
34bbadd3ff standarize fp8 generator (#1078) 2023-09-07 14:36:33 -04:00
88c0d7c726 make only visible on device (#1071) 2023-09-07 13:00:46 -04:00
e01b9b5029 Shard gemm reference templates into multiple TUs for parallel compilation (#1043)
* Split apart gemm reference templates into multiple TUs for parallel compilation

* remove old files

* better balancing of ref kernels across TUs

* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size

* remove auto fp8 kernels

* remove some redundant kernels
2023-08-30 16:46:30 -04:00
34fd98056b fix cinttypes issue with STDC_FORMAT_MACROS (#1068)
* fix cinttypes issue with STDC_FORMAT_MACROS

* Update mma_sm90_desc.hpp

* Update mma_sm90_desc.hpp

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-08-29 14:59:33 -04:00
3a8f57a3c8 Add simple hash and eq methods for gemm_operations. (#1053) 2023-08-27 20:41:57 -04:00
6673df0e48 fix typos (#1059) 2023-08-27 00:49:26 -04:00
7618e9bfd8 Fix numeric conversion warning (#1021)
* fix numeric conversion unused var

* update

---------

Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>
2023-08-27 00:42:44 -04:00
a88c41cf8d Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
27de343535 Add one Publication which is inspired by cutlass (#1022) 2023-08-22 10:00:17 -04:00
2a9fa23e06 Avoid cute::print compiler warnings with -Wformat-security (#1041)
Fixes issue #1040.
2023-08-18 14:38:27 -04:00
2e56cfabee fix typo (#1047) 2023-08-18 14:08:26 -04:00
3930f709ce Fix typo in 0x_gemm_tutorial.md (#1035) 2023-08-17 10:52:20 -04:00
7e5ee8b7bf [doc] fix: fix typos in the comment (#1049) 2023-08-16 11:39:25 -04:00
2d9a557427 torch.bfloat16 support in cutlass python (#1037)
* torch.bfloat16 support in cutlass python

* Update datatypes.py
2023-08-16 11:38:53 -04:00
4575443d44 CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
a0d787b746 Fix one publication (#1019) 2023-07-28 11:40:17 -04:00
d20f3a9542 spelling (#1007)
logicial -> logical
2023-07-20 14:41:11 -04:00
8e85580859 fix layout bug (#1006) 2023-07-19 14:26:01 -04:00
146d314057 Update fMHA kernels (#992)
* Update fMHA kernels

Upstream recent changes to fMHA that we did in xFormers.
Previous version in CUTLASS: facebookresearch/xformers@b6be33a
Updating to: facebookresearch/xformers@55a4798

* minor changes

* make var work

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-07-12 22:30:46 -04:00
f679663224 Add RMS norm (#979) 2023-07-10 21:31:27 -04:00
e066ced33b fix epilogue iterator error (#995)
* fix epilogue iterator error

* fix epilogue iterator error

---------

Co-authored-by: maxiao <maxiao@cowarobot.com>
2023-07-10 21:30:31 -04:00
9b923dd4c4 fix minor typos (#984) 2023-07-05 09:23:01 -04:00
f6d42f2dd0 add library_dirs (#977) 2023-06-14 12:09:12 -04:00
473a67073e Fix Int8 and TF32 generator (#976) 2023-06-12 12:32:52 -04:00
87349d3496 Add grouped b2b GEMM (#970) 2023-06-05 17:16:57 -04:00
fde824af21 Update Hopper performance plot for CUTLASS 3.1 + CTK 12.1 (#967) 2023-06-01 14:52:40 -04:00
7dbf423763 Add conversion from ElementBias to ElementCompute (#961) 2023-05-26 23:08:36 -04:00
6f47420213 Update README.md 2023-05-24 12:40:31 -04:00
4638250469 Update CHANGELOG.md 2023-05-24 12:39:42 -04:00
7859fe322a Update PUBLICATIONS.md 2023-05-24 12:36:12 -04:00
d3e72719b4 Add support for sparse GEMM with row broadcasted bias vector (#951) 2023-05-24 10:25:05 -04:00
b4ab501767 Adds CUDA path for x86-64 (#957) 2023-05-24 10:21:25 -04:00
f079619f5e More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
13f413493a Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00
6fbc0d3380 Update layout.md 2023-05-17 20:12:58 -04:00
b97404837e Adding 128x256 tile for 16b input datatype WGMMA gemm (#950) 2023-05-17 17:13:23 -04:00
e2953d47c5 Update gemm_api.md 2023-05-12 15:37:31 -04:00
wll
19c4a4815e replace division with multiplication in GELU (#942) 2023-05-12 10:57:18 -04:00
fcfbd23e26 Fix host compilation of cute::cast_smem_ptr_to_uint. (#940)
* Remove references to device-only intrinsics when compiling for host.

Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.

Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.

* Annotate cute::cast_smem_ptr_to_uint as device-only.

The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-10 00:06:54 -04:00
b250faccd3 Make operator() const-correct and add missing static functions. (#936)
* Make operator() const-correct and add missing static functions.

Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.

* minor changes

* format

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-09 16:33:01 -04:00
24c8b7d8a2 Fix cuTE compilation with clang (#939)
- clang 1.14 complains about missing function from a host call:
  cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
  return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2023-05-09 09:51:45 -04:00
7c04f95415 Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
6f8596ce3f Add missing #include directive to get access to cutlass::epilogue:🧵:ScaleType. (#925)
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
2023-04-28 20:02:41 -04:00
fe2f491dd7 Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927) 2023-04-28 13:23:23 -04:00
df02482f1d Add missing schedules argument in SM90 fp16 op generation (#920) 2023-04-26 16:44:49 -04:00
180c5629bf Add missing checks for NVRTC in CuTe (#921) 2023-04-25 12:52:43 -04:00
e36912f961 Fix for dangling references in the MHA example (#918) 2023-04-19 21:35:46 -04:00
9a83bd3381 CUTLASS 3.1 Python interface documentation (#917)
* Add 12.1 Dockerfile

* Add 3.1 docs
2023-04-18 15:11:35 -04:00
54bebe417d Fix some typos in CuTe tutorials (#912) 2023-04-17 16:00:51 -04:00
43cfbe0086 Allow L2 prefect for clang compiler (#914) 2023-04-15 01:23:22 -04:00
4a68cf748e added support of b2b bmm (#849)
* added support of b2b bmm

* fixed arguments and params structures

* added batch_count argument

* removed SplitKSerial and added new test case with b2b bmm

* fixed support of Kbatched and added new test case with batch stride

* added batch support for bias and scale

* make test

* small changes

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-14 23:20:02 -04:00
d572cc1aab CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
9b8166e3f0 fMHA: Add backward pass (#844)
* fMHA: Add backward pass

* Better checks for strides/alignments

* Remove fb-internal URL

* torch.Tensor.untyped_storage requires pytorch 2.0+

* minor changes

* make test

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-06 20:44:58 -04:00
e2d439ee7e Add tile_n=32 and tile_k=32 kernels in generator.py (#858) 2023-04-06 10:00:52 -04:00
0435979f59 Remove const from 3.x GemmUniversalAdapter::operator() (#905) 2023-04-03 20:30:51 -04:00
2ba1ef10be Increase max dynamic SMEM size in GemmSoftmax (#903) 2023-04-03 10:01:12 -04:00
0964bdb64c update gemm and conv2d cmdline --help output (#878) 2023-04-01 11:38:13 -04:00
ecbd24566c Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754)
* Enable shared memory intrinsics and ldmatrix PTX on Clang.

This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.

* refine the macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-31 21:42:24 -04:00
660a05f581 fix split_k_mode and add reduction kernel for f16 input/accum/output (#896) 2023-03-30 15:31:08 -04:00
bc36122c3f [layout] Fix AffineRank2ColumnMajor::packed() (#879)
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-29 11:59:48 -04:00
15d9d31f1f CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
1eef5c3cf1 add guards for __CUDA_ARCH__ >= 530 (#891)
* add guards for sm>=70

* drop guard to 530
2023-03-28 17:47:10 -04:00
87070b6d51 add a CUTLASS publication (#893)
* add bytetransformer

* update arxiv link

* re-order
2023-03-28 17:06:57 -04:00
77549ae6c8 Update PUBLICATIONS.md
msft moe paper
2023-03-25 21:17:05 -04:00
42290f5d1c Fix for dangling pointers (#885) 2023-03-25 01:15:14 -04:00
209faf7b94 remove spurious comma (#871) 2023-03-20 17:25:27 -04:00
6116706c96 Set batch_strides on Params::update (#883) 2023-03-20 17:07:47 -04:00
2670b973dd Fix sign-compare warning in reorder_array (#869)
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well


Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
2023-03-20 17:07:24 -04:00
af332d4aa9 Add missing comma in cutlass/arch/mma_sm90.h (#862) 2023-03-14 12:04:28 -04:00
86cae03cea expose StoreT parameter for potential speed (#838)
* expose StoreT parameter for potential speed

* add storeT to more elementwise

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-10 12:58:17 -05:00
29801e348a Hide streams and typinfo from nvrtc (#853)
* Hide streams and typinfo from nvrtc

* Use __CUDACC_RTC__ instead CUDA_ARCH for guard
2023-03-09 23:24:47 -05:00
7e370c9637 Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
c4f6b8c6bc Updates for 3.0 (#857)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-03-09 15:27:40 -05:00
a68e2f95f0 Reduce versbosity in manifest.py (#845) 2023-03-07 11:53:01 -05:00
a31b43b3f3 Re-enable aarch64 support lost in 277bd6e537 (#846) 2023-03-02 11:17:21 -05:00
f396cdd15c ex24[gemm_grouped]: Allow to change layout/dtype (#841)
* ex24[gemm_grouped]: Allow to change layout/dtype

* Address suggestion from @jackkosaian

---------

Co-authored-by: danthe3rd <danthe3rd>
2023-03-01 07:13:51 -05:00
92ebbf1dc4 Fix typos (#839) 2023-02-27 11:17:58 -05:00
65688c2a87 streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
f303889ed9 fMHA: Sync FW with xFormers (#828)
* fMHA: Add support for bias+dropout in FW

* Remove 'getMaximumSharedMemoryPerBlockKb'

* fix comments

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-22 23:25:31 -05:00
9cdbe33570 Add fixed_channel and few_channel mode to int8 in generator (#829) 2023-02-21 21:15:39 -05:00
95f673ecf7 Update base_grouped.h (#832) 2023-02-21 14:48:30 -05:00
91b8de8d32 streamk fix (#830)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-20 11:03:16 -05:00
d8359c804b Changes to iterators to support s8 gemm with f16 outputs (#812)
* Changes to iterators to support s8 gemm with f16 outputs

* should work

---------

Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-16 18:37:51 -05:00
34bed24af3 Update helper.h
copyright banner
2023-02-16 16:50:04 -05:00
ZZK
a101ac283f Fix some typos (#791)
* fix typo

* fix a deadlink to code
2023-02-16 15:56:55 -05:00
9fb38ac048 fix alignmentC=8 for imma N=128 (#822)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-15 12:06:00 -05:00
8f5c242426 Update dual_gemm_common.h
fix the copyright of a new file.
2023-02-13 15:35:33 -05:00
3c995c7606 Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790)
* Fix MHA kernel

Summary:

ATT

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Extend DualGemm to support batched mode (#5)

Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.

* Decouple LayoutB0 and LayoutB1 in DualGemm

The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.

In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.

* Remove comment as no longer relevant

* Revert Fix MHA kernel

---------

Co-authored-by: mikeiovine <mikeiovine@fb.com>
2023-02-13 15:27:13 -05:00
ce8597dc14 Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
2e10404d26 xFormer updates to fMHA FW (#773)
* xFormer updates to fMHA FW

* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'

* Add missing files

* Remove xFormers specific code

* Update fused_multihead_attention_fixed_seqlen.cu

* rebase and solve conflicts

* remove white space

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-08 23:00:10 -05:00
5ff5209ed5 Add acc2smem in epilogue/threadblock/epilogue.h (#806) 2023-02-06 22:04:16 -05:00
5921043981 Re-enable all alignments for int accumulators (#807) 2023-02-06 22:01:15 -05:00
add4ba622f Fix 8.4 + CUDA 11.4 build (#789)
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.

Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).

Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.

CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND

Fixes https://github.com/NVIDIA/cutlass/issues/788

Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
2023-01-27 09:18:59 -05:00
277bd6e537 CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
66d9cddc83 New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
d49bef88f9 Enable aarch64 support (#779) 2023-01-20 15:51:58 -05:00
8b42e751c6 streamk paper link (#765)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 22:10:43 -05:00
eb7f99d3dd @hwu36 Adding the individual arXiv link for Stream-K paper. (#764)
* Stream-K individual paper entry.

* arXiv links updated.
2023-01-10 20:39:06 -05:00
764b840d6f streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
a1046d49c1 Adds missing semicolon (#759) 2023-01-09 21:50:46 -05:00
1cd994b4cf Update PUBLICATIONS.md
@neoblizz @dumerrill 

thesis covering streamk
2023-01-08 00:42:19 -05:00
7bdba07310 Add definitions for tag structs. (#752)
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
2023-01-06 09:46:52 -05:00
c54ede3a9e Add const overloads for iterator functions. (#753)
This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.
2023-01-06 09:46:34 -05:00
ff6e733fe1 restore the old epilogue for everything except streamk (#749)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-04 11:02:55 -05:00
5989b7e1d7 Update PUBLICATIONS.md
Add coconet paper to the publication list.  @abhijangda
2023-01-04 09:18:38 -05:00
1e64f153b3 improve streamk load balance (#743)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-25 13:56:33 -05:00
78b30d3191 Update README.md 2022-12-21 11:58:19 -05:00
59de82688b Update README.md 2022-12-21 11:57:55 -05:00
b85865d1ad Add missing #include directives (#741)
This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:

```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                               ^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
          ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                              ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                             ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                                               ~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
    OutputTileIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
  using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
        ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
    SharedLoadIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
      ^
```
2022-12-21 11:40:20 -05:00
3f2bb17722 minor chagnes (#730)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-10 14:44:53 -05:00
38193d76e3 Updates for stream-k (#728)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-12-08 23:48:10 -05:00
1d7772f218 Add missing #include directive (#727) 2022-12-08 18:58:31 -05:00
df81d847d7 Make Python interface work for non-SM80 targets (#726)
* Make Python interface work for non-SM80 targets

* Remove line in README
2022-12-07 21:53:33 -05:00
d6117ca362 Relax stream K gemm alignment constraints (#717)
* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

* Revert "Relax stream K gemm alignment constraints"

This reverts commit 31e80a250e.

* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-07 11:17:49 -05:00
9c0518608e Fix typos in conv problem sizes (#720)
* Fix typos in conv problem sizes

* Typos
2022-12-05 15:54:58 -05:00
9f1f37aa21 misc (#719)
* misc

* minor

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-05 12:07:20 -05:00
84213b0b8e fix: make arch.h self contained (#714) 2022-12-01 19:25:48 -05:00
8567b87d65 Update quickstart.md (#704)
* Update quickstart.md

* Update doxygen_mainpage.md

* Update doxygen_mainpage.md

* Update terminology.md
2022-11-29 21:43:03 -05:00
c975e2ccbb releaase 2.11 (#703) 2022-11-19 09:02:15 -05:00
3c90f6aea6 add #pragma once for header file in example 42 (#698) 2022-11-15 22:50:24 -05:00
06eb90cc0d Fix identity sigmoid activation (#659)
* activation support Identity

* fix Sigmoid activation operator() with CUTLASS_HOST_DEVICE
2022-11-09 14:42:23 -05:00
168ea8b0e1 ensure singleton::get thread safe construct instance (#658)
* ensure singleton::get thread safe construct instance

* fix singleton return reference

Co-authored-by: xuweiqi <xuweiqi117@gmail.com>
2022-11-08 21:44:32 -05:00
012c62c748 bug fixes and enharcement to gemm reductionK fusion (#682)
* add two missing files

* fix bunch of bugs of gemm-reducek fusion and add a device interface

* small changes

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-11-03 11:07:50 -04:00
FZC
cc85b64cf6 fix typo (#677) 2022-11-01 14:07:33 -04:00
1b4e24470a Example 43 - DualGemm (#670)
* Ex50 wip

* IS_PROFILING mode

* MultiStage2 - but is slower

* Add SwiGLU

* Support SplitKSerial reduction
Support not storing D0/D1
Cleanup code

* Option to disable bias

* Renumber example

* Fix build

* Remove references to pb_size_0 / pb_size_1

* Add support for bf16 inputs with float accum

* small changes

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-10-26 14:04:42 -04:00
8c1bf9b784 Bump CUTLASS Python container version (#672)
* Update example 40 README

* Update CUTLASS Python README
2022-10-22 21:09:39 -04:00
7d0dd6706e Remove excessive includes from examples/41_multi_head_attention (#669)
The rationale behind this change is explained in #563
2022-10-21 22:23:15 -04:00
9b47403b2d Add missing CUTLASS_HOST_DEVICE (#671) 2022-10-21 22:20:38 -04:00
4db6a6140e ex42: Fused MHA imported from xFormers (#662)
* ex42: Fused MHA imported from xFormers

* Remove std:: references

* Support K>128 in the example

* Support causal option

* Support different head size for V, and different seqlength for KV

* Update FLOPS counter

* Remove bit_cast

* fix build: Replace M_LOG2E

* Add doc

* Revert "Remove bit_cast"

This reverts commit 9662fa86bb.

* Explicit casts to int32_t for windows build

Co-authored-by: danthe3rd <danthe3rd>
2022-10-17 10:49:33 -04:00
3bf95e90c2 Update labeler.yml 2022-10-13 08:03:28 -04:00
75fed7493e Update labeler.yml 2022-10-13 08:01:21 -04:00
98b73fc95d Update labeler.yml 2022-10-13 07:55:33 -04:00
4990e3686d Update labeler.yml 2022-10-13 07:52:38 -04:00
4b7365388c Update labeler.yml 2022-10-13 07:32:55 -04:00
0d8405588d Update labeler.yml 2022-10-12 15:32:38 -04:00
cb539dab78 Correct typos in comments (#639)
* Correct typos in comments

Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method

* Inline Box Muller

Added inline function for the Box Muller algorithm and updated code comments to be more concise

* Update tensor_fill.h

* Update tensor_fill.h

* small changes to pass tests

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-09-30 22:51:30 -04:00
dadc881a96 Bug fix for gemm broadcast (#650)
* gemm_universal_with_broadcast, +2 sources.

* Revert "gemm_universal_with_broadcast, +2 sources."

This reverts commit fb063251f2.

* gemm broadcast bug fix
2022-09-30 10:00:38 -04:00
f3eea3a4d7 Create labeler.yml 2022-09-29 15:08:44 -04:00
cd37e82492 change unused class member to local var (#646) 2022-09-28 23:52:35 -04:00
48a9ea223a Fix release version in the citation (#638) 2022-09-22 10:58:45 -04:00
7a458f00a6 fix(permute.h): incorrect comment in Tensor5DPermute20314 (#637)
* fix(permute.h): incorrect comment in `Tensor5DPermute20314`

* typo in usage in example 39
2022-09-22 09:21:13 -04:00
97bff52e8c add two missing files (#636)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-09-21 15:42:42 -04:00
9f2e3faa69 fix call of GELU_Taylor in LinearCombinationGeneric (#634) 2022-09-20 21:00:55 -04:00
a821280dc7 Gemm broadcast (#632)
* gemm_universal_with_broadcast, +2 sources.

* Revert "gemm_universal_with_broadcast, +2 sources."

This reverts commit fb063251f2.

* gemm_universal_with_broadcast separated version.

* Update copyright banner.

* update banner
2022-09-20 10:37:12 -04:00
f73374a1eb fix:comment typo in example 23 (#633) 2022-09-19 09:54:14 -04:00
faab7536fc add comment (#628) 2022-09-17 21:40:30 -04:00
fc9ebc645b CUTLASS 2.10 bug fixes and minor updates. (#626) 2022-09-15 16:20:33 -04:00
2cc2c7ba1f Add set_k_partition function (#624)
A member function set_k_partition is required for the instatiation of cutlass::gemm::kernel::Gemm, even though SplitKSerial is false
2022-09-13 22:34:20 -04:00
50ceed7154 Minor README fix (#623)
* minor fix

* Minor fix
2022-09-12 22:40:25 -04:00
e773429f7e CUTLASS 2.10 updates (#622)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-12 21:26:30 -04:00
beae168f90 fix broken link (#620)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2022-09-06 16:32:44 -04:00
f29d8f7ca9 Include vector in base_grouped.h (#618) 2022-09-06 13:21:23 -04:00
b1d3f9b2fd upstream internal updates (#616)
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2022-09-04 23:05:09 -04:00
b72cbf957d CUTLASS 2.10 (#615)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-03 18:48:46 -04:00
ca23ff7924 Fixed typo in class name (#608) 2022-08-29 20:51:52 -04:00
1c3d400b14 Added value_type trait to complex to make it an easier drop-in replacement for std::complex. (#607) 2022-08-28 01:12:40 -04:00
abafbf2afd Missing comma in trmm header (#604) 2022-08-25 16:07:33 -04:00
536b20763e Fixed typo in profiler README (#603) 2022-08-24 21:55:13 -04:00
497b499d9d Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32. (#590)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-08-15 11:19:24 -04:00
e66bfcb1f8 Fix for #596 (typo in example 03) (#597)
* [examples] Fix typos in SYRK and TRMM examples

* Fix typo in example 03
2022-08-09 09:58:36 -04:00
1617685a77 fix: fix types in example 06 (#587) 2022-07-29 12:46:06 -04:00
25ebf15d02 Ensure all arch::Mma specializations have ElementC set (#576)
Co-authored-by: danthe3rd <danthe3rd@users.noreply.github.com>
2022-07-22 23:53:03 -04:00
5d05808072 fix gather example (#574) 2022-07-19 16:18:17 -04:00
0b8cacd6f1 Remove redundant <fstream> includes (#563)
* Remove redundant <fstream> includes

* Fix fstream in examples/

* Fix <fstream> in test/

* Use consistent order for <fstream> (always after <iostream>)

* Remove an unneeded include in a file where std::ofstream usage is commented out

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-07-19 15:23:54 -04:00
e7a61c761a fix race condition when h < stride_h or w < stride_w (#562)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-07-12 16:37:08 -04:00
fb379eaa5b epilogue leaky relu support ScaleType (#564)
Co-authored-by: xuweiqi <xuweiqi117@gmail.com>
2022-07-11 17:30:55 -04:00
8a766804ad Fix doc in testbed_gemm_with_broadcast (#559) 2022-07-07 09:56:16 -04:00
1eb6355182 [activation] tanh (#550)
Co-authored-by: Bing Xu <bingxu@fb.com>
2022-07-02 08:00:45 -04:00
04a9777b87 Softmax (#546)
* add test layernorm g-mem version

* Delete include/configure directory

* Delete examples/test_layernorm directory

* Update gemm_with_softmax.h

* Update gemm_softmax.cu

* Update linear_combination.h

* Update fast_math.h

* remove redundant vars

Co-authored-by: yujia.zhai <yujia.zhai@bytedance.com>
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2022-07-02 01:19:18 -04:00
e45e773436 Update linear_combination_generic.h (#472)
add `skip_elementwise_` to support serial splitk in linear_combination_generic.h`
2022-06-28 07:29:38 -04:00
dae6b6893b Update CHANGELOG.md 2022-06-27 23:30:49 -04:00
ba18ea9c32 Update README.md 2022-06-27 23:25:26 -04:00
9ab9110168 add leaky relu (#542)
Authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-06-26 10:07:50 -04:00
e5d4669f16 Update CHANGELOG.md (#543) 2022-06-25 13:23:49 -04:00
94f01f19d5 Add implicit gemm perf
plot from @manishucsd, presented in gtc'22 cutlass talk
2022-06-23 22:47:11 -04:00
fa56763c25 Fix occupancy calculation for grouped GEMM (#532) 2022-06-18 19:53:59 -04:00
25e26a6e51 fix bugs in linear_combination_generic.h missing include cutlass/epilogue/thread/scale_type.h (#531) 2022-06-17 23:35:14 -04:00
f248e9bdb4 Create CITATION.cff
Add initial CITATION.cff
2022-06-07 21:25:16 -04:00
dceefe4f64 Increment stride correctly in warp iterator. (#516)
Co-authored-by: peisun1115 <peis@google.com>
2022-06-06 12:33:36 -04:00
c3881d097e Fix a comment about LDSM layout. (#514)
Co-authored-by: peisun1115 <peis@google.com>
2022-06-04 23:04:00 -04:00
a29dfb1c63 Fix a bug to increment stride tile correctly (#503)
* Fix a bug to increment stride tile correctly

* Update regular_tile_access_iterator_tensor_op.h

Co-authored-by: peisun1115 <peis@google.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2022-06-03 22:54:52 -04:00
0abaac84ea [examples] Fix typos in SYRK and TRMM examples (#507) 2022-06-03 22:52:41 -04:00
858c735856 Update gather_scatter_fusion.cu
Correct the reference code in gather/scatter example to put bias add in the correct place.
2022-05-18 13:15:25 -04:00
d6f58b2d14 Update functionality.md 2022-05-11 09:34:24 -04:00
c4cf0dad82 Fix init-self compiler warnings (#493)
Fix a few errors caused by trying to initialize a class member
with itself. These errors can turn into errors if you compile
with `-Winit-self`.
2022-05-11 00:35:28 -04:00
57551902d0 Update functionality.md
add some explanations to the functionality table.
2022-05-11 00:01:19 -04:00
1604ebaf10 Update generator.py
stop generating analytical conv kernels to reduce kernel number
2022-05-08 21:47:15 -04:00
6023038bae add verification of the reduction tensor (#489)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-05-06 10:24:51 -07:00
ddd8f9cf41 update float < int32_t * 4 (#488)
Co-authored-by: 赵俊涛 <zhaojuntao@zhaojuntaos-MacBook-Pro.local>
2022-05-04 13:36:05 -04:00
ec2b4fd85d b2b bias vector support (#482)
* b2b bias vector support

* add files

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-04-30 04:16:15 -07:00
86ce09aed1 2.9 fixes for nvrtc (#480)
* Use platform::is_same instead of std::is_same

* Don't hide cuComplex include from nvrtc

* Typo fixed

* Remove comment rename
2022-04-29 09:06:52 -04:00
21c1fa3849 add .github (#479)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-04-28 12:36:59 -07:00
8c339ac039 Fix compilation in clang (#478)
- adds missing commas
- adjusts misaligned usage of CUTLASS_DEVICE between
  template declaration and specializations

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-04-28 14:22:06 -04:00
e49f690fd7 Update linear_combination_generic.h 2022-04-28 14:04:53 -04:00
96dad61a75 Update CHANGELOG.md 2022-04-28 10:52:10 -04:00
cc2ea4c3fc Update README.md 2022-04-28 10:50:11 -04:00
a0de301283 Used relative paths for includes (#477) 2022-04-27 12:04:23 -07:00
319a389f42 Update CMakeLists.txt (#473)
* Update CMakeLists.txt

Add 128bit int support if using nvc++ to solve #310 

@jeffhammond, would you please give it a try?

* Update CMakeLists.txt

correct copy paste error
2022-04-27 07:02:26 -07:00
71def2f084 Use platform:: instead of std::abs and std::conditional (#452)
* Fixed template struct/class mismatch

* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation

* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation

* Revert absolute_value() usage
2022-04-25 14:40:22 -04:00
70f3ba57f5 Fix typo in shared memory layout description (#471) 2022-04-24 18:32:13 -04:00
dd77fadc70 Remove redundant offset def and init in shared_load_iterator.h (#456)
Signed-off-by: Fujun Han <fujun.han@iluvatar.ai>
2022-04-24 16:31:00 -04:00
be4578d517 Fixed template struct/class mismatch (#453) 2022-04-24 16:30:21 -04:00
d7b499deff Fix CUDA_PERROR_EXIT and print failing expression (#446)
`CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider:

```
if (condition)
    CUDA_PERROR_EXIT(cudaFree(x))
else
    free(x);
```

The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward.

The change adds a `do while` around the `if`, which always requires a semicolon.

This PR also adds the text of the failing expression to the printed error message.
2022-04-24 16:29:43 -04:00
310ed81ac3 fix description in example 12. (#444)
Co-authored-by: Exusial <Exusial>
2022-04-24 16:29:06 -04:00
4c0d6e1eb4 [BUGFIX]: Force unroll a loop that doesn't have compilation constant (#441)
loop times is dangerous.

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2022-04-24 16:28:32 -04:00
167ac54c65 Fix link to Python example (#469) 2022-04-23 15:37:38 -04:00
12f4108ac2 CUTLASS 2.9 (#468) 2022-04-23 15:02:38 -04:00
dd571f0edb [style] fix code indentation (#449)
* [docs] fix typo in media/docs/layout.md

* [docs] fix comment error

* fix typo in include/cutlass/arch/simd_61.h

* fix stride comment errors in TensorLayout

* fix indentation
2022-04-03 21:13:17 -04:00
6d0d265047 Update PUBLICATIONS.md (#447) 2022-04-03 21:03:28 -04:00
f11fa975a5 Update PUBLICATIONS.md
@tsuki
2022-03-23 21:04:43 -04:00
0e71d9b450 Transposed conv2d and wgrad split k examples (#413)
* add split k wgrad example

* wgrad done

* begin transposed conv2d example

* update transposed conv2d example and add ref check

* update doc for conv2d transpose example

* add license

* add wgrad doc

* more clarification on GEMM output type

* typo fix

* clean up indent

* address comments

* rename example numbers to 34 and 35

* GEMM -> Implicit GEMM

* Revert "rename example numbers to 34 and 35"

This reverts commit 551a808c22.

* transposed_conv2d is 34

* add compiler and device version check to exit gracefully

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-03-23 14:52:54 -04:00
eb0d4c9213 [library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412)
Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.
2022-03-22 12:36:34 -04:00
bc45e2c023 fixed datatype error of numeric_limit for uint1b_t (#419)
Co-authored-by: Haojin Yang <haojin.yang@.hpi.uni-potsdam.de>
2022-03-22 12:30:30 -04:00
095cbba57c Example 23 - Passing correct alpha and beta values with --parallel-split-k (#424)
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.

The fix was from hwu36. I only did fixed some minor typos along with his
fix.
2022-03-22 12:27:34 -04:00
8f1fe7a132 Fix separate compilation -dc (#433)
* Fix separate compilation `-dc`

- when cutlass is included in multiple compilation units
  compiled with `-dc` OOB_NAN_F16x8 device constant is
  instantiated multiple times causing
  Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
  This PR makes this variable a local constant as it is not
  modified during runtime

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Fix

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Revert test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-03-22 12:21:18 -04:00
3ab1eacf09 Fix typo in profiler examples (#437) 2022-03-21 12:00:13 -04:00
cd39c75e25 Fix typo in docs, code comments (#429)
* [docs] fix typo in media/docs/layout.md

* [docs] fix comment error

* fix typo in include/cutlass/arch/simd_61.h

* fix stride comment errors in TensorLayout
2022-03-15 21:54:36 -04:00
b2e1e97cb1 Update PUBLICATIONS.md
ACM Trans on Graphics from nv research.
2022-03-01 22:37:18 -05:00
96a11a1ef3 Removed trivial copy constructors on parameter classes to enable devi… (#366)
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels

* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code

* std => platform

* fix affine2

* really fix affine2

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-28 21:34:02 -05:00
e96f00586c Make cutlass::gemm::device::GemmArray usable (#295)
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray

* Add a reference to GemmArray to the docs

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-02-17 20:01:05 -05:00
3cfa5db2a2 Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407)
* Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

As title

* Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

change the missing one

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2022-02-16 09:53:21 -05:00
1db6971a8d Remove unused gemm_k_iterations in GemmKernel::Params (#406)
Otherwise we get gemm_k_iterations is uninitialized warnings.
2022-02-16 09:52:45 -05:00
b954127297 Update PUBLICATIONS.md
@jackkosaian
2022-02-14 16:54:32 -05:00
d0d941efc7 [hardswish] correct implmentation (#403)
* [hardswish] correct implmentation

* seems working

* hardswish fp32/fp16x2 optimization

* [relu] half2 support

* add relu0; add multiply_add_relu0;

* cleanup

Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-09 14:28:53 -05:00
8a951b2940 Enable convolution with fused epilogue for Volta Tensor Cores (#402)
* Enabled convolution with epilogue fusion for Volta Tensor Cores.

* Compilation fixes

* Disabled testing Volta on Ampere architectures.
2022-01-30 23:24:50 -05:00
1e4703cbab Support parallel split K mode for porfiling (#277)
* Support parallel split K mode for porfiling

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* Parallel Split K support

  1. find gemm kernel by preference key
  2. switch m n for redution kernel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* parallel splitk for fp16 gemm

* add one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-01-27 10:37:37 -05:00
c3353add63 Merge pull request #388 from depaulmillz/fix/headersonly
Fix utils include not being installed in header only
2022-01-26 14:22:51 -06:00
ac8825b941 Minor fix to change from LIBRARY_INIT to LIBRARY 2022-01-26 15:17:46 -05:00
8fd94806e5 Update PUBLICATIONS.md
add mlsys 2022 paper.
2022-01-17 00:08:18 -05:00
d7c9cbf0b9 Fix typo in scripts/library.py (wrong data size for u8) (#393) 2022-01-07 13:29:56 -05:00
c2ee13a0fe Add epilogue functor for residual block fusion (#391)
* Add epilogue functor for residual block fusion

* Do not run split-k tests when ActivationOp is not Identity

* explain TestSplitK param

* return early
2021-12-29 22:53:40 -05:00
f78994bb40 add the missing pieces (#392)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-25 04:29:54 -08:00
dceabd4c5a Support half precision sigmoid activation (#378)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* move the math to fast_math.h

* fixed compile

* .raw() -> .to_half()

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-22 14:45:06 -05:00
86fa1dc30b Fix utils include not being installed in header only 2021-12-21 12:10:26 -05:00
288af365db Added missing synchronization to avoid WAR hazards between tiles. (#386) 2021-12-20 08:34:08 -08:00
0dc3ba60b3 Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* refactored sigmoid using the new interface

* refactored gelu

* add silu activation

* add hardswish

* remove sigmoid for now

* add description to silu and hardswish, and other doc update

* Do not ignore Round

* use constant N

* Set isHeavy = true in sigmoid and silu epilogue
2021-12-18 14:58:15 -05:00
ec4f7e5194 Updates to fused epilogue (#383)
* Enhancements and fixes to fused GEMM and Convolution epilogue.
* Need to explicitly list cudart as unit test library dependency.
2021-12-17 16:04:43 -05:00
4e666e1dfd Updated README and added issue templates. (#382) 2021-12-17 09:26:20 -05:00
3799e12f25 Merge pull request #381 from Peter9606/update-makefile-version
Update project version to 2.8.0 in CMakeLists.txt
2021-12-16 21:54:57 -05:00
fc3bc85db8 Update project version to 2.8.0 in CMakeLists.txt
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-12-17 02:23:31 +00:00
49c0a58d50 Set theme jekyll-theme-minimal 2021-12-15 14:51:24 -05:00
5fe09c2d67 Updated GEMM performance plot with CUTLASS 2.8 compiled with CUDA 11.5 Toolkit (#375)
Updated GEMM performance plot with CUTLASS 2.8 compiled using CUDA 11.5 Toolkit.

GPUs under test:

    NVIDIA A100
    NVIDIA A2
    NVIDIA TitanV
    NVIDIA GeForce 2080 Ti
2021-12-06 14:21:33 -05:00
6b69c79ac3 Fixed contributor formatting. (#365) 2021-11-22 11:30:53 -08:00
62e438f450 Listed Matthew Nicely as the CUTLASS product manager.. (#364) 2021-11-19 17:51:21 -08:00
808c25337a CUTLASS 2.8 (#363)
CUTLASS 2.8
2021-11-19 13:26:35 -08:00
6fc5008803 Update quickstart.md
fix a broken link
2021-11-11 09:53:46 -05:00
a3bcc6981d Merge pull request #331 from reed-lau/feature/fix-wmma-shape-typo
fix wmma shape typo
2021-09-28 10:20:29 -04:00
3b28642801 fix wmma shape typo 2021-09-28 19:04:09 +08:00
538592dea4 example 23 gemm operand reduction fusion (#325) 2021-09-20 13:34:47 -07:00
2e07c4cc2f CUTLASS 2.7 (#318)
CUTLASS 2.7

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!

authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
2021-09-20 11:02:22 -07:00
9ac255863f Merge pull request #246 from mengchihe/master
support unalignment input for conv2d fprop stage=2 Fix for issue #242
2021-09-08 11:40:53 -04:00
59e2aa505a refine the implementation 2021-09-08 13:14:08 +00:00
4e8af93da1 Merge remote-tracking branch 'origin/master' into small_alignment 2021-09-07 20:39:38 +00:00
6c2f8f2fb8 CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning
* cutlass 2.6 update

* remove debug prints

* cutlass 2.6.1 (minor update)

* Updated CHANGELOG.

* Minor edit to readme to indicate patch version.

* Minor edit to readme.

Co-authored-by:  Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2021-09-03 10:26:15 -07:00
598e35401c Merge remote-tracking branch 'origin/master' into small_alignment 2021-08-16 07:49:08 -07:00
a01feb93d9 Merge pull request #308 from dongxiao92/patch-1
fix typo in doc
2021-08-08 11:54:42 -07:00
d36f331b44 fix typo in doc
fix typo
2021-08-08 16:44:22 +08:00
69abafb85a Merge pull request #306 from NVIDIA/fix-profiler-cmd-doc
Fix profiler cmd doc
2021-07-30 14:36:54 -04:00
68a078fbbf cleanup 2021-07-30 11:27:21 -07:00
10709dbb64 clean profiler cmd and doc 2021-07-30 11:02:17 -07:00
1227351079 Merge pull request #305 from NVIDIA/fix_epilogue_spill
fix epilogue register spill
2021-07-29 14:30:11 -07:00
a77c658439 fix epilogue register spill 2021-07-29 14:25:48 -07:00
4516b833ce Merge pull request #303 from Peter9606/doc_typo
Doc typo
2021-07-28 20:49:06 -04:00
64dd1e1915 Doc typo
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-07-29 08:45:59 +08:00
1ac4559d12 Cutlass 2.6 Update 1 (#301)
* cutlass 2.6 update

* remove debug prints
2021-07-27 17:58:30 -07:00
e5d51840e8 CUTLASS 2.6 (#298)
CUTLASS 2.6
2021-07-23 00:40:53 -04:00
6c29fe20ba Merge pull request #285 from tjingrant/patch-1
Typo Fixes
2021-07-05 22:51:19 -04:00
e3c56b0d6b Update predicated_tile_iterator.h 2021-07-05 12:11:53 -04:00
4647c57243 Update predicated_tile_iterator.h 2021-07-05 12:06:41 -04:00
856d4db3fb Update basic_gemm.cu
fix the matrix malloc size
2021-06-15 09:08:36 -04:00
6a1064093f Merge pull request #274 from mani-ananth/master
Some pending Bug fixes
2021-06-02 13:17:39 -04:00
c5f1ef4dff update contributors 2021-06-02 10:11:42 -07:00
47ebfccbec bug fixes 2021-06-02 10:08:25 -07:00
ad9486684f Merge pull request #272 from BernardoCovas/master
Bug in reference conv3d
2021-05-28 17:18:27 -04:00
1d8372a8e2 fix typo in reference conv3d 2021-05-28 21:06:59 +01:00
9cb7d63424 Merge pull request #266 from mani-ananth/master
Fixes for public issue #265
2021-05-19 15:15:22 -04:00
da2f110906 Fixes for public issue #265 2021-05-19 10:16:52 -07:00
b68113f5be Merge pull request #264 from zheng95z/patch-3
Adds `NoBetaScaling` for `LinearCombination`
2021-05-17 10:03:30 -04:00
a68d7cd6f1 Adds NoBetaScaling for LinearCombination 2021-05-12 22:23:55 +08:00
38e8b29f56 Merge pull request #259 from hzfan/ignore_pr
Add gitignore
2021-05-10 20:07:53 -04:00
ee7349c94f fix 2021-05-10 16:39:04 +08:00
8cdd4293d4 add gitignore 2021-05-10 16:37:59 +08:00
f58b843951 Merge pull request #239 from KeDengMS/kedeng/gelu
Fixes to Gelu for half and fusion
2021-05-08 12:51:42 -04:00
5fc142296f Merge pull request #237 from Peter9606/issue_236_typo
Typo fix issue#236
2021-05-08 07:51:19 -04:00
233d69aa6d Merge pull request #235 from Peter9606/issue_233_tranpose_update
tranpose.h update based on issue#233
2021-05-07 07:14:30 -04:00
9840d25269 Merge pull request #256 from zheng95z/patch-2
Fixes some typos in utilities.md
2021-05-06 11:02:49 -04:00
b878c96421 Fixes some typos in utilities.md 2021-05-06 22:37:37 +08:00
8f8a80cad5 Merge pull request #251 from zheng95z/patch-1
add a missing 'device_memory::' before a function
2021-04-25 22:09:44 -04:00
a8f6f8eb07 add a missing 'device_memory::' before a function 2021-04-25 20:05:39 +08:00
f4b0a33633 add unit test for non int4 load 2021-04-23 14:33:46 +08:00
7c783adf53 Merge pull request #247 from xue-fc/patch-1
fix a wrong description
2021-04-22 09:27:40 -04:00
4000df9567 fix a wrong description 2021-04-22 20:28:28 +08:00
bb35a3ba6f support setting load granularity for conv2d fprop 2021-04-22 15:20:57 +08:00
7ec3a87f22 support unalignment input for conv2d fprop stage=2 Fix for issue #242 2021-04-21 14:40:05 +08:00
0b74c8f473 Address CR 2021-04-19 23:36:06 +00:00
83036ed646 More clean up 2021-04-18 04:29:20 +00:00
b7e43f5eb9 Clean up 2021-04-18 04:24:25 +00:00
5c62d892fa Add test 2021-04-18 04:09:34 +00:00
41a31b404b Fixes to Gelu for half and fusion 2021-04-17 22:10:19 +00:00
7320aee17d Typo fix issue#236
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-15 15:08:35 +08:00
2142a05d9d tranpose.h update based on issue#233
1. Add 'pragma once' preprocess directive
 2. Replace prmt PTX with __byte_perm intrinsic

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-14 19:58:00 +08:00
c77a524459 Merge pull request #230 from mani-ananth/master
Fix for issue #221
2021-04-09 14:45:55 -04:00
fac6680f31 Merge branch 'master' of github.com:NVIDIA/cutlass 2021-04-09 11:36:31 -07:00
08993707da fixing functional bug in fused epilogue 2021-04-09 11:36:03 -07:00
c805593ebe Merge pull request #228 from mani-ananth/master
Fix for issue#224 and issue#225
2021-04-08 10:08:13 -04:00
26556d7206 fix a broken sparse gemm example. found by the community. 2021-04-07 13:32:55 -07:00
4839b6cb61 add 2stage fprop 3d into default file 2021-04-07 13:29:32 -07:00
d97214987a Merge pull request #220 from Peter9606/wrong-stride-array-definition
Bugfix: typo, make reduction device cases passed
2021-04-02 08:43:52 -04:00
b0bbc6d548 Merge pull request #219 from mani-ananth/master
Fix for issue #211
2021-04-02 08:42:09 -04:00
7074047a54 Bugfix: typo, make reduction device cases passed
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-02 09:35:23 +08:00
75a4737cfe Fix for public issue #211
- Add a slice-K tile size to the profiler
- fix num warps calculations in implicit gemm header
2021-04-01 14:42:00 -07:00
8a3e4b8d02 Merge pull request #214 from Peter9606/separate-stream-error
Bugfix: memsetAsync uses wrong default stream
2021-03-24 12:09:01 -04:00
6a6b4028bd Revert wrong fix of params.update in GemmUniversalBase
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-23 23:20:40 +08:00
92393b2676 Bugfix: memsetAsync uses wrong default stream
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-23 21:11:42 +08:00
50bf00e5f2 Merge pull request #193 from Peter9606/public_shape_type_from_Mma_HFMA2
HFMA2 Convolutions for SM60 onwards
2021-03-05 21:38:59 -05:00
4cd004ead1 fix test name to optimized and instance large tile sizes to speed unit tests 2021-03-05 13:32:36 -08:00
6c4539e372 Make arch tag of test cases more precisely to SM60
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-05 10:53:26 +08:00
a3639ab1a0 Append fp16 test case to verify Mma_HFMA2
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-04 18:17:57 +08:00
169181f30f Make Shape public from Mma_HFMA2.
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-04 11:05:16 +08:00
0f1056390d Create PUBLICATIONS.md (#189) 2021-03-03 11:17:40 -08:00
34a42e5620 Update generator.py (#192) 2021-03-02 12:21:48 -08:00
8f09b82b12 Merge pull request #187 from NVIDIA/cutlass_2.5
CUTLASS 2.5.0
2021-02-26 23:56:04 -06:00
200a5a5146 Enabled reduction unit tests. 2021-02-26 15:46:57 -05:00
746b7b3247 Enabled tensor reduction kernels. 2021-02-26 15:32:19 -05:00
abdf16a4d9 Updated release notes. 2021-02-26 13:55:04 -05:00
0e13748649 CUTLASS 2.5 2021-02-26 09:58:26 -05:00
ccb697bac7 cutlass 2.4 documentation only update 2020-11-23 06:59:45 -06:00
e6bcdc60cf fix broken links (#148) 2020-11-19 21:46:54 -08:00
6615010cd0 CUTLASS 2.4 (Implicit GEMM convolution) (#147)
CUTLASS 2.4 (Implicit GEMM Convolution)

Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2020-11-19 21:25:25 -08:00
c2b80ad4e4 Merge pull request #135 from NVIDIA/cutlass_2.3_final
CUTLASS 2.3.0
2020-09-25 13:25:26 -05:00
37a8f9e598 CUTLASS 2.3.0 final. 2020-09-25 10:34:46 -07:00
c53f3339bb CUTLASS 2.3 initial commit (#134)
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
2020-09-23 14:00:58 -07:00
4dac7490e6 Typoes (#107)
* Update splitk_gemm.cu

* Update gemm_bias_relu.cu

* Update mma_sm75.h
2020-07-13 14:25:52 -07:00
fd7e058d0c Added examples to enable the unity build (#102)
* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
2020-06-17 07:09:18 -07:00
1ab1027954 Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100)
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
2020-06-15 10:47:01 -07:00
86931fef85 CUTLASS 2.2 (#96)
Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.
2020-06-08 16:17:35 -07:00
e33d90b361 update tools/library/CMakeLists to require python 3.6 according to #70 (#82)
#70 only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well.
2020-04-08 10:54:36 -07:00
96dab34ad9 CUTLASS 2.1 (#83)
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
2020-04-07 13:51:25 -07:00
7c0cd26d13 Need Python 3.6 to use enum.auto() (#70) 2019-11-22 09:39:12 -08:00
45ecbc885b Removed redundant conjugation operations from matrix_traits. (#65) 2019-11-20 11:27:13 -08:00
8aca98f9a7 Improved formatting, clarity, and content of several documents. (#64)
* Improved formatting, clarity, and content of several documents.
2019-11-20 10:42:15 -08:00
f4d9c8f755 Clang GPU compilation requires explicit CUDACC version flags (#63) 2019-11-20 09:52:11 -08:00
fb335f6a5f CUTLASS 2.0 (#62)
CUTLASS 2.0

Substantially refactored for

- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code

Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler

Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs

Note: this commit and all that follow require a host compiler supporting C++11 or greater.
2019-11-19 16:55:34 -08:00
b5cab177a9 Performance enhancement for Volta Tensor Cores TN layout (#53)
* Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.

* Updated patch version and changelog.

* Updated patch version and changelog.

* Added link to changelog in readme.

* Fixed markdown link
2019-07-10 10:54:12 -07:00
eb41735933 Merge pull request #47 from Artem-B/cutlass-1.3-clang
Make CUTLASS compileable with Clang.
2019-05-13 10:52:45 -07:00
fb8b3a98b7 Addressed code review comments. 2019-05-10 10:24:52 -07:00
d9d357877f Added missing file (#48) 2019-05-09 14:07:52 -07:00
e18292db46 Make CUTLASS compileable with Clang.
Requires a recent clang build (r359248 or newer).

Enable compilation with clang with these options:
cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=/path/to/clang++
2019-05-02 11:00:22 -07:00
fe3438a3c1 cutlass 1.3.1 (#46)
CUTLASS 1.3.1 patch resolves failing text with NVRTC.
2019-04-19 16:54:52 -07:00
877bdcace6 Cutlass 1.3 Release (#42)
CUTLASS 1.3 Release
- Efficient GEMM kernel targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.
2019-03-20 10:49:17 -07:00
19a9d64e3c Removed patch version from README.
Removed patch version from README.
2018-12-19 15:20:43 -08:00
80e6f7c860 Merge pull request #38 from NVIDIA/resolve_maxwell
Resolved issue for incorrect SGEMM on Maxwell architecture.
2018-12-19 15:17:41 -08:00
822b0952cd Resolved issue for incorrect SGEMM on Maxwell architecture. 2018-12-19 15:07:16 -08:00
ed2ed4d667 Merge pull request #33 from NVIDIA/cutlass_1.2
CUTLASS 1.2
2018-10-26 14:59:50 -07:00
4db423c40f Minor edit to CHANGELOG. 2018-10-26 14:58:31 -07:00
b2bc0d3b79 Updating Doxygen docs 2018-10-26 14:54:58 -07:00
74df0331f2 CUTLASS 1.2 2018-10-26 14:38:46 -07:00
2332df492e Merge pull request #30 from NVIDIA/fix_utilities_example
Fixed cutlass_utilities example.
2018-09-29 15:09:18 -07:00
cfe4b933ef CUDA 9 lacks host-side conversions from float=>half. Instead, we must reinterpret_cast<> from cutlass::half_t => half. 2018-09-29 15:04:20 -07:00
6877595a5e Merge pull request #28 from NVIDIA/cutlass_1.1
Fixed typeo
2018-09-28 12:59:49 -07:00
69e3709da4 Fixed typeo
Fixed typeo
2018-09-28 12:59:20 -07:00
d419094c28 Merge pull request #26 from NVIDIA/cutlass_1.1
Clarification to README
2018-09-21 11:44:47 -07:00
1a7ac522f8 Clarification to README 2018-09-20 11:04:03 -07:00
bf6eec53eb Merge pull request #25 from NVIDIA/cutlass_1.1
Updated CUTLASS.md
2018-09-19 21:33:04 -07:00
206e38dac5 Updated copyright of CUTLASS.md 2018-09-19 21:31:12 -07:00
d85f6a1cec Merge pull request #24 from NVIDIA/cutlass_1.1
Cutlass 1.1
2018-09-19 21:16:53 -07:00
0826572c4c Reduced range of random values to avoid bit-level inconsistencies for large matrices. 2018-09-19 21:11:48 -07:00
77d1e0ca81 Updated README and CHANGELOG. 2018-09-19 20:42:51 -07:00
d7137f9c0a Updated doxygen 2018-09-19 14:02:08 -07:00
461f417b9d Checkpointing CUTLASS 1.1 release. 2018-09-18 16:58:03 -07:00
cf0301e00f Merge pull request #15 from NVIDIA/release_1.0.1_edits
Minor edits to README and changelog pursuant CUTLASS 1.0.1 patch.
2018-06-26 13:59:01 -07:00
b9bb0d1a49 Edits to README and changelog pursuant CUTLASS 1.0.1 patch. 2018-06-26 13:57:39 -07:00
e1c4ba501b Merge pull request #13 from NVIDIA/cutlass_v1.0.1
Cutlass v1.0.1
2018-06-12 08:25:56 -07:00
c566e83e6d Updated changelog. 2018-06-11 14:54:07 -07:00
374882be53 Replaced GoogleTest copy with submodule. Added updates to support intra-threadblock reductions. Added tests for same. 2018-06-11 11:47:15 -07:00
2c496c3e9e Replaced GoogleTest copy with Git submodule. 2018-06-11 11:32:41 -07:00
9fd55460c6 Merge pull request #10 from NVIDIA/cutlass_v1.0_rel
Minor updates to usage and README.
2018-05-18 12:27:31 -07:00
480732c2e8 Minor updates to usage and readme. 2018-05-17 15:10:55 -07:00
68aaee8773 Merge pull request #9 from NVIDIA/cutlass_v1.0_rel
Updated URL to Doxygen and modified usage statement
2018-05-17 11:12:37 -07:00
acb90e962a Updated url to Doxygen and modified usage statement in performance test program. 2018-05-17 11:11:05 -07:00
96bc3f227f Merge pull request #8 from NVIDIA/cutlass_v1.0_rel
Configured Github Pages
2018-05-16 15:26:55 -07:00
25ff282403 Moved Doxygen documents. 2018-05-16 15:25:24 -07:00
9d5726a568 Set theme jekyll-theme-minimal 2018-05-16 13:49:06 -07:00
8300 changed files with 1975996 additions and 300392 deletions

38
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@ -0,0 +1,38 @@
name: Bug Report
description: Create a bug report to help us improve CUTLASS
title: "[BUG] "
labels: ["? - Needs Triage", "bug"]
assignees: []
body:
- type: dropdown
id: component
attributes:
label: Which component has the problem?
options:
- CuTe DSL
- CUTLASS C++
validations:
required: true
- type: textarea
id: bug-report
attributes:
label: Bug Report
description: Please fill out all sections below
value: |
**Describe the bug**
A clear and concise description of what the bug is.
**Steps/Code to reproduce bug**
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment details (please complete the following information):**
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
**Additional context**
Add any other context about the problem here.
validations:
required: true

5
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: CUTLASS Discord
url: https://discord.gg/nvidiadeveloper
about: Come chat about using and contributing to CUTLASS!

View File

@ -0,0 +1,35 @@
---
name: Documentation request
about: Report incorrect or needed documentation to improve CUTLASS
title: "[DOC]"
labels: "? - Needs Triage, documentation"
assignees: ''
---
## Report incorrect documentation
**Location of incorrect documentation**
Provide links and line numbers if applicable.
**Describe the problems or issues found in the documentation**
A clear and concise description of what you found to be incorrect.
**Steps taken to verify documentation is incorrect**
List any steps you have taken:
**Suggested fix for documentation**
Detail proposed changes to fix the documentation if you have any.
---
## Report needed documentation
**Report needed documentation**
A clear and concise description of what documentation you believe it is needed and why.
**Describe the documentation you'd like**
A clear and concise description of what you want to happen.
**Steps taken to search for needed documentation**
List any steps you have taken:

View File

@ -0,0 +1,35 @@
name: Feature Request
description: Suggest an idea for CUTLASS
title: "[FEA] "
labels: ["? - Needs Triage", "feature request"]
assignees: []
body:
- type: dropdown
id: component
attributes:
label: Which component requires the feature?
options:
- CuTe DSL
- CUTLASS C++
validations:
required: true
- type: textarea
id: feature-request
attributes:
label: Feature Request
description: Please fill out all sections below
value: |
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I wish I could use CUTLASS to do [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Additional context**
Add any other context, code examples, or references to existing implementations about the feature request here.
validations:
required: true

View File

@ -0,0 +1,10 @@
---
name: Submit question
about: Ask a general question about CUTLASS
title: "[QST]"
labels: "? - Needs Triage, question"
assignees: ''
---
**What is your question?**

51
.github/workflows/auto-label-issues.yml vendored Normal file
View File

@ -0,0 +1,51 @@
name: Auto Label Issues
on:
issues:
types: [opened]
jobs:
add-labels:
runs-on: ubuntu-latest
permissions:
issues: write
steps:
- name: Add component label
uses: actions/github-script@v7
with:
script: |
const issue = context.payload.issue;
const body = issue.body || '';
// Parse the issue body to find the component selection
// GitHub renders dropdown selections as "### {label}\n\n{selection}"
// Check for both bug report and feature request dropdown labels
const bugComponentMatch = body.match(/### Which component has the problem\?\s*\n\s*\n\s*(.+?)(?:\n|$)/);
const featureComponentMatch = body.match(/### Which component requires the feature\?\s*\n\s*\n\s*(.+?)(?:\n|$)/);
const componentMatch = bugComponentMatch || featureComponentMatch;
if (componentMatch) {
const component = componentMatch[1].trim();
let label = '';
// Map component selections to labels
switch(component) {
case 'CuTe DSL':
label = 'CuTe DSL';
break;
case 'CUTLASS C++':
label = 'CUTLASS C++';
break;
}
if (label) {
await github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
labels: [label]
});
console.log(`Added label: ${label}`);
}
}

112
.github/workflows/blossom-ci.yml vendored Normal file
View File

@ -0,0 +1,112 @@
#################################################################################################
#
# Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
#################################################################################################
# A workflow to trigger ci on hybrid infra (github + self hosted runner)
name: Blossom-CI
on:
issue_comment:
types: [created]
workflow_dispatch:
inputs:
platform:
description: 'runs-on argument'
required: false
args:
description: 'argument'
required: false
jobs:
Authorization:
name: Authorization
runs-on: blossom
outputs:
args: ${{ env.args }}
# This job only runs for pull request comments
if: |
(startsWith(github.event.comment.body, '/bot run') ||
startsWith(github.event.comment.body, '/bot kill')) && contains(
fromJson('["nv-fastkernels-cicd", "zekunf-nv", "hwu36", "IonThruster", "thakkarV", "d-k-b", "mihir-awatramani", "fengxie", "vickiw973", "Junkai-Wu", "brandon-yujie-sun", "lijingticy22", "hongw-nv", "vikgupta-nv", "IwakuraRein", "depaulmillz", "jackkosaian", "itramble", "ccecka", "sxtyzhangzk", "hbarclay", "yzhaiustc", "x86vk", "sklevtsov-nvidia", "ANIKET-SHIVAM", "Shreya-gaur", "azhurkevich", "serifyesil", "richardmcai", "lsyyy666", "Ethan-Yan27", "XiaoSong9905", "shdetect", "keithzzzzz"]'),
github.actor)
steps:
- name: Check if comment is issued by authorized person
run: blossom-ci
env:
OPERATION: 'AUTH'
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_KEY_DATA: ${{ secrets.BLOSSOM_KEY }}
Vulnerability-scan:
name: Vulnerability scan
needs: [Authorization]
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
with:
repository: ${{ fromJson(needs.Authorization.outputs.args).repo }}
ref: ${{ fromJson(needs.Authorization.outputs.args).ref }}
lfs: 'true'
- name: Run blossom action
uses: NVIDIA/blossom-action@main
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_KEY_DATA: ${{ secrets.BLOSSOM_KEY }}
with:
args1: ${{ fromJson(needs.Authorization.outputs.args).args1 }}
args2: ${{ fromJson(needs.Authorization.outputs.args).args2 }}
args3: ${{ fromJson(needs.Authorization.outputs.args).args3 }}
Job-trigger:
name: Start ci job
needs: [Vulnerability-scan]
runs-on: blossom
steps:
- name: Start ci job
run: blossom-ci
env:
OPERATION: 'START-CI-JOB'
CI_SERVER: ${{ secrets.CI_SERVER }}
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Upload-Log:
name: Upload log
runs-on: blossom
if : github.event_name == 'workflow_dispatch'
steps:
- name: Jenkins log for pull request ${{ fromJson(github.event.inputs.args).pr }} (click here)
run: blossom-ci
env:
OPERATION: 'POST-PROCESSING'
CI_SERVER: ${{ secrets.CI_SERVER }}
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}

11
.github/workflows/labeler.yml vendored Normal file
View File

@ -0,0 +1,11 @@
name: "Pull Request Labeler"
on:
- pull_request_target
jobs:
triage:
runs-on: ubuntu-latest
steps:
- uses: actions/labeler@main
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"

View File

@ -0,0 +1,35 @@
name: Auto Assign New Issues to Triage Project
on:
issues:
types: [opened]
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
assign_one_project:
runs-on: ubuntu-latest
name: Assign to New Issues to Triage Project
steps:
- name: Process bug issues
uses: docker://takanabe/github-actions-automate-projects:v0.0.1
if: contains(github.event.issue.labels.*.name, 'bug') && contains(github.event.issue.labels.*.name, '? - Needs Triage')
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
- name: Process feature issues
uses: docker://takanabe/github-actions-automate-projects:v0.0.1
if: contains(github.event.issue.labels.*.name, 'feature request') && contains(github.event.issue.labels.*.name, '? - Needs Triage')
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'
- name: Process other issues
uses: docker://takanabe/github-actions-automate-projects:v0.0.1
if: contains(github.event.issue.labels.*.name, '? - Needs Triage') && (!contains(github.event.issue.labels.*.name, 'bug') && !contains(github.event.issue.labels.*.name, 'feature request'))
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_PROJECT_URL: https://github.com/NVIDIA/cutlass
GITHUB_PROJECT_COLUMN_NAME: 'Needs prioritizing'

57
.github/workflows/stale.yml vendored Normal file
View File

@ -0,0 +1,57 @@
name: Mark inactive issues and pull requests
on:
schedule:
- cron: "0 * * * *"
jobs:
mark-inactive-30d:
runs-on: ubuntu-latest
steps:
- name: Mark 30 day inactive issues and pull requests
uses: actions/stale@v3
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
stale-issue-message: >
This issue has been labeled `inactive-30d` due to no recent activity in the past 30 days.
Please close this issue if no further response or action is needed.
Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue will be labeled `inactive-90d` if there is no activity in the next 60 days.
stale-issue-label: "inactive-30d"
exempt-issue-labels: "0 - Blocked,0 - Backlog,good first issue"
days-before-issue-stale: 30
days-before-issue-close: -1
stale-pr-message: >
This PR has been labeled `inactive-30d` due to no recent activity in the past 30 days.
Please close this PR if it is no longer required.
Otherwise, please respond with a comment indicating any updates.
This PR will be labeled `inactive-90d` if there is no activity in the next 60 days.
stale-pr-label: "inactive-30d"
exempt-pr-labels: "0 - Blocked,0 - Backlog,good first issue"
days-before-pr-stale: 30
days-before-pr-close: -1
operations-per-run: 50
mark-inactive-90d:
runs-on: ubuntu-latest
steps:
- name: Mark 90 day inactive issues and pull requests
uses: actions/stale@v3
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
stale-issue-message: >
This issue has been labeled `inactive-90d` due to no recent activity in the past 90 days.
Please close this issue if no further response or action is needed.
Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
stale-issue-label: "inactive-90d"
exempt-issue-labels: "0 - Blocked,0 - Backlog,good first issue"
days-before-issue-stale: 90
days-before-issue-close: -1
stale-pr-message: >
This PR has been labeled `inactive-90d` due to no recent activity in the past 90 days.
Please close this PR if it is no longer required.
Otherwise, please respond with a comment indicating any updates.
stale-pr-label: "inactive-90d"
exempt-pr-labels: "0 - Blocked,0 - Backlog,good first issue"
days-before-pr-stale: 90
days-before-pr-close: -1
operations-per-run: 50

4
.gitignore vendored Normal file
View File

@ -0,0 +1,4 @@
# PyCache files
__pycache__/
cutlass_library.egg-info/
/build*

864
CHANGELOG.md Normal file
View File

@ -0,0 +1,864 @@
# Changelog
# CUTLASS 4.x
## [4.3.0](https://github.com/NVIDIA/cutlass/tree/main) (2025-10-20)
### CuTe DSL
* Debuggability improvements:
- Supported source location tracking for DSL APIs
- Supported dumping PTX and CUBIN code
* More examples and notebooks to get started with CuTe DSL:
- [Kernel launch with Programmatic Dependent Launch](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/programmatic_dependent_launch.py)
- Improved performance of elementwise kernel (https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/elementwise_apply.py):
+ Generalize code to handle list of input tensors
+ Generalize TV layout computation to handle different data types
- Demonstrate the new Pipeline APIs in [Blackwell SM100 persistent dense GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py):
+ New Pipeline API `PipelineProducer` and `PipelineConsumer` to simplify code (no more explicit pipeline state management)
- Separate epilogue code for non-TMA and TMA implementation
+ Note that the updates simplifies the codes but existing APIs still work and are supported
- [Basic Blackwell SM100 GEMM with decent performance](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/tutorial_gemm/fp16_gemm_0.py)
+ Simple tutorial achieves 84% SOL performance with MNK 8K
- Reworked [elementwise add notebook](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/elementwise_add.ipynb) with more details and detailed explanation about TV layout
+ Updated implementation to handle general data type and multiple inputs
+ Updated explanation for TV layout in simpler language
+ Added visualization of TV Layout with 3rd party utils
- [Benchmark and autotune demonstration](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/benchmark_autotune.ipynb)
* More examples of authorizing peak-performance kernels:
- [Blackwell SM100 mixed-input GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/mixed_input_gemm.py)
- [Blackwell SM100 persistent blockwise dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/blockwise_gemm.py)
- [Blackwell SM100 persistent blockwise contiguous grouped dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/contiguous_grouped_gemm.py)
- [Blackwell SM100 persistent blockwise masked grouped dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/masked_grouped_gemm.py)
- [Blackwell SM100 fmha bwd](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/fmha_bwd.py)
- [Blackwell SM100 mla](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/mla.py)
- [Hopper SM90 persistent dense GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/hopper/dense_gemm_persistent.py)
- [Blackwell GeForce batched dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell_geforce/dense_gemm.py)
- [Ampere HSTU Attention](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/hstu_attention.py)
* API updates:
- Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details
* Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in [Blackwell SM100 persistent dense blockscaled GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py).
- Fixed ``TensorSSA.reduce`` to support static value as initial value
- Updated docstring for following APIs to be more concise and easier to understand:
- ``make_layout_tv``
- ``is_static``
- ``PipelineAsync``
- ``SmemAllocator``
- Fixed documentation for ``pipeline``, ``utils`` and ``cute.math``
### CUTLASS C++
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
* Add Ragged Contiguous Grouped gemm kernel in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm/).
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
* Optimize group gemm kernels by enabling async TMA desc update.
* Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: [fprop_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/fprop/sm100_conv3d_fprop_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu), [dgrad_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/dgrad/sm100_conv3d_dgrad_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu), [wgrad_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/wgrad/sm100_conv2d_wgrad_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu).
* Add profiler support for Blackwell SM100 and SM120 blockscaled sparse kernels.
* Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic `cutlass::int8_t` and replace it with `int8_t`.
* Fix some profiler issues:
- Add some missing reference kernels.
- Add calculation of scale factor A and B in function `bytes_with_problem_shape` of block scaled profiler.
* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
* Optimal code generation with CUDA toolkit versions 13.0U1.
## [4.2.1](https://github.com/NVIDIA/cutlass/releases/tag/v4.2.1) (2025-09-22)
### CuTe DSL
* Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
- Fixed an issue when running inductor with DSL codes
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer
- Fixed the issue reported in https://github.com/NVIDIA/cutlass/issues/2647
- Fixed an issue when conditional define of variables outside of dynamic control flow
### CUTLASS C++
* Bypass EVT for nosmem blockwise kernels on Blackwell.
* Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.
## [4.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0) (2025-09-15)
### CuTe DSL
* More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
* Added new example and updated notebook to get started with CuTe DSL
- [Call kernels with dlpack bypassed](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/call_bypass_dlpack.py)
- Updates on [TensorSSA demonstration](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/tensorssa.ipynb)
+ Added a section for introducing the broadcast
* API updates
- Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details
* Bug fixings and improvements
- Fixed ``cute.print_tensor`` for coordinate tensor
- Fixed `cute.print` for tuple of layouts
- Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent
### CUTLASS C++
* Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: [Blockscaled datatypes with support for dense GEMM mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm103_blockscaled_mma_warpspecialized.hpp)
- New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: [Blockscaled datatypes with support for dense GEMM kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm103_blockscaled_gemm_tma_warpspecialized.hpp).
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- [Blockscaled ultra fp4 dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/89_sm103_fp4_ultra_gemm/).
- [Blockscaled ultra fp4 dense grouped GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/90_sm103_fp4_ultra_grouped_gemm).
* Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of `sm103_` under [GEMM device unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/).
* Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
* Add support for heuristics-based kernel filtering and autotuning using `nvidia-matmul-heuristics` to find the best kernels for a given scenario.
- Details please refer to [heuristics doc](https://github.com/NVIDIA/cutlass/tree/main/media/docs/cpp/heuristics.md).
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where `get_unmasked_trip_count` may return a negative value.
- Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
* Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm/). It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
* Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: [example 87](https://github.com/NVIDIA/cutlass/tree/main/examples/87_blackwell_geforce_gemm_blockwise/).
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
* Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: [Gemv kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/gemv_blockscaled.h).
- Example codes: [example 91](https://github.com/NVIDIA/cutlass/tree/main/examples/91_fp4_gemv/)
* Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: [Mixed input mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp).
- Kernel codes: [Mixed input kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized_mixed_input_transform.hpp).
- Example codes: [example 86](https://github.com/NVIDIA/cutlass/tree/main/examples/86_blackwell_mixed_dtype_gemm/).
* Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: [cpasync mainloop](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_cpasync_warpspecialized.hpp).
- Kernel codes: [cpasync kernel](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_cpasync_warpspecialized.hpp).
* Support Blackwell SM120 mixed input blockscaled grouped GEMM.
* Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` to instantiate all possible combinations.
- To use this feature, `CUTLASS_LIBRARY_KERNELS` must be non-empty. Profiler will combine `CUTLASS_LIBRARY_KERNELS` and `CUTLASS_LIBRARY_INSTANTIATION_LEVEL` to instantiate specific kernels.
- Details please check [Profiler Doc](https://github.com/NVIDIA/cutlass/tree/main/media/docs/cpp/profiler.md).
* Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
* From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
* Rename legacy Python API package from `cutlass` to `cutlass_cppgen` and add Blackwell EVT support to legacy Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's `EpilogueDescriptors`.
- Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
* CuTe changes:
- Fix inaccurate GridDim calculation under [CuTe tutorial](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/blackwell/).
- Add [movmatrix](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-movmatrix) support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten `nullspace` implementation.
- Isolate and comment on `cosize` risky changes.
- Important documentation correction: `E<0,1> == 1@0@1`.
* Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
* Add following unit tests:
- [fp16 accmulator for sm89 fp8 mma](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/ampere/cooperative_gemm.cu)
- [movmatrix test](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/turing/movm.cu)
- [fp8 narrow mma n](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_tensorop_gemm/f16_f16_void_f32_narrow_mma_n.cu) and [fp16 narrow mma n](test/unit/gemm/device/sm100_tensorop_gemm/f8_f8_void_bf16_narrow_mma_n.cu)
* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
* Optimal code generation with CUDA toolkit versions 13.0U1.
## [4.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.1.0) (2025-07-16)
### CuTe DSL
* Add aarch64 support, you can now pip install `nvidia-cutlass-dsl` on GB200 systems!
* More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- [Blackwell Mamba2 SSD](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/mamba2_ssd/mamba2_ssd.py)
- [Blackwell SM100 persistent dense blockscaled GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py)
* API updates
- Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details
### CUTLASS C++
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
* Replace `subbyte_iterator` with `cute::recast_ptr` when constructing logical iterators/arrays.
* CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy `get_layoutA|B|C_MN` and friends from Atoms/TiledX.
- Factor out `print_latex` and friends and rewrite.
- Factor out `print_svg` and friends and rewrite.
* Support Blackwell SM100 SIMT packed fp32x2 kernels.
* Support residual add for implicit gemm kernels.
* Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
* Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
* Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.
* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
* Optimal code generation with CUDA toolkit versions 12.9.
## [4.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0) (2025-06-03)
### CuTe DSL
* CuTe DSL, a Python DSL centered around CuTe's abstractions
- [Core DSL implementation files](https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL)
- [DSL quick start](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html)
- [DSL Overview](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/overview.html)
* [Overhauled documentation with a new dedicated website](https://docs.nvidia.com/cutlass/latest)
* Set of examples demonstrating how to use CuTe DSL to write peak-performance kernels
- [Blackwell SM100 persistent dense GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py)
- [Blackwell SM100 grouped GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/grouped_gemm.py)
- [Blackwell SM100 fused multi-head attention forward pass](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/fmha.py)
- [Hopper GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/hopper/dense_gemm.py)
- [Ampere GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/tensorop_gemm.py)
- [FlashAttention-2 implementation targeting Ampere and Ada class GPUs (SM80, SM86, SM89)](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/flash_attention_v2.py)
- [SmemAllocator to facilitate shared memory allocation and management](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/smem_allocator.py)
- [C-structure based customized interface between JIT function and user codes](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/ffi/jit_argument.py)
* [Educational notebooks for getting started with CuTe DSL](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks)
* API updates
- Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details
### CUTLASS C++
* Support [Family Specific Architecture Features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/) which was introduced in CUDA 12.9
- 100f, 101f, 120f were added to support Family Specific Architecture Features which allows running the same binary on different chips belonging to the same Family (e.g. sm100) without recompiling. Note 101a is supported since CUTLASS 3.9
* Instruction shapes and redundant accumulation type have been removed from CUTLASS 3.x-style library kernel names to disambiguate kernels and shorten names.
- For example:
+ `(old) cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma`
+ `(new) cutlass3x_sm90_tensorop_gemm_bf16_bf16_f32_bf16_bf16_128x256x64_1x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma`
- If you are using the CUTLASS library kernel names directly (e.g. to compile a subset of the CUTLASS library with `-DCUTLASS_LIBRARY_KERNELS`, filter kernels in the CUTLASS profiler with `--kernels`), please update your uses accordingly, this is a breaking change.
* Further improved [Blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMMs on Hopper and Blackwell.
- Added non-power-of-two tile sizes.
- Improved performance for K-major scale factors.
- The argument `mma_promotion_interval` has been removed from non-grouped GEMM to align with the grouped and Blackwell SM100 versions.
* Enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
- Support LSE output in FMHA Forward kernel.
- Enhance performance measurement: support of different warmup iterations; buffer rotation to keep L2 cold; separate testing of persistent and non-persistent.
- Enhance testing of variable sequence length.
- Disable B2B mode in MLA to simplify the sample.
- Clarify that `fmha_gen` sample only supports head dim 128.
- Fixes for split-kv output in MLA.
* Improve Blackwell and Hopper grouped GEMM performance, functionality, and profiler support.
- Enable runtime datatype for Blackwell SM100 grouped GEMM. Profiler support is also added.
- Enable kernel parameter exploration for Blackwell SM100 grouped GEMM - raster_order, swizzle.
* Add [Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/).
* Add dynamic and preferred cluster support for convolution Blackwell SM100 kernels.
* Fix profiler issues which cause no output or not supported error for some kernels.
* Optimizations for Blackwell SM100 and SM120 block scaled kernels.
* Support for Blackwell SM120 blockwise dense gemm in CUTLASS library and profiler.
* New [Hopper SM90 FMHA example](https://github.com/NVIDIA/cutlass/tree/main/examples/88_hopper_fmha/), similar in design to the existing [Blackwell FMHA](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
* CuTe changes:
- Rework `cute::copy_if` so that the predicate tensor is also a true CuTe Tensor rather than a lambda and introduces transform-tensors to avoid any extra register or load/store overhead in using bool-tensors.
- New [CuTe tutorial](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/tiled_copy_if.cu) to show the usage of copy_if in tile copy.
- Add [CuTe C++ reduce op](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/tensor_reduce.hpp).
- Add several [unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/cute/core/tensor_algs.cpp) for CuTe tensor algorithms.
* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
* Optimal code generation with CUDA toolkit versions 12.9.
# CUTLASS 3.x
## [3.9.2](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.2) (2025-05-03)
* Fixed [Blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM hang issue when problem size K is 128.
* Optimal code generation with CUDA toolkit versions 12.9.
## [3.9.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.1) (2025-04-30)
* Fixed Group Gemm hang issue in CUTLASS 3.x
* Improved Hopper [Blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) and [Groupwise](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) GEMM performance.
## [3.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.9.0) (2025-04-24)
* Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
* [Blockscaled datatypes with support for dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_mma_tma.hpp)
* [Blockscaled datatypes with support for sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm120_blockscaled_sparse_mma_tma.hpp)
- New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp) and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
- [Blackwell SM120 epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_visitor_store_tma_warpspecialized.hpp) and [full set of EVT fusions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm120_callbacks_tma_warpspecialized.hpp).
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
- [Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu).
- [Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79b_blackwell_geforce_nvfp4_nvfp4_gemm.cu).
- [Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu).
- [Grouped GEMM with nvfp4 datatype](https://github.com/NVIDIA/cutlass/tree/main/examples/79_blackwell_geforce_gemm/79d_blackwell_geforce_nvfp4_grouped_gemm.cu).
- [Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/80_blackwell_geforce_sparse_gemm/80a_blackwell_geforce_mxfp8_bf16_sparse_gemm.cu).
- [Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor](https://github.com/NVIDIA/cutlass/tree/main/examples/80_blackwell_geforce_sparse_gemm/80b_blackwell_geforce_nvfp4_nvfp4_sparse_gemm.cu).
* Set of unit tests that demonstrate the usage of both [sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_sparse_tensorop_gemm/) and [dense](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm120_blockscaled_tensorop_gemm/) Blackwell SM120 blockscaled GEMM.
* Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
* [SM100 Sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_sparse_mma_warpspecialized.hpp)
* Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
- [Sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/83_blackwell_sparse_gemm/83_blackwell_sparse_gemm.cu)
- [Blockscaled Sparse GEMM with NVFP4 input data type](https://github.com/NVIDIA/cutlass/tree/main/examples/84_blackwell_narrow_precision_sparse_gemm/84a_blackwell_nvfp4_bf16_sparse_gemm.cu)
- [Blockscaled Sparse GEMM with mixed input data type (MXFP8 and MXFP4)](https://github.com/NVIDIA/cutlass/tree/main/examples/84_blackwell_narrow_precision_sparse_gemm/84b_blackwell_mixed_mxfp8_bf16_sparse_gemm.cu)
* Set of unit tests that demonstrate the usage of [sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_sparse_tensorop_gemm) and [blockscaled sparse](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm100_blockscaled_sparse_tensorop_gemm) Blackwell SM100 GEMM.
* A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS [example](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/) covers the flashMLA-like weight-absorbed decoding use-case.
* A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS [example](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/) to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
* A new [distributed GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/82_blackwell_distributed_gemm/82_blackwell_distributed_gemm.cu) for SM100 Blackwell architecture.
* Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) for Hopper architecture.
- Enhancement of [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu) for Hopper architecture.
- Support for [grouped GEMM with blockwise and groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling/) for Hopper architecture.
- Support for [grouped-wise GEMM](https://github.com/NVIDIA/cutlass/tree/main/tools/profiler/src/blockwise_gemm_operation_profiler.cu) in CUTLASS profiler.
- Support for [blockwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_blockwise.cu) for Blackwell architecture.
- Support for [groupwise GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_gemm_groupwise.cu) for Blackwell architecture.
- Support for [grouped GEMM with blockwise](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_blockwise.cu) and [groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/81_blackwell_gemm_blockwise/81_blackwell_grouped_gemm_groupwise.cu) for Blackwell architecture.
* Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in [profiler.md](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html#exhaustive-search-mode-and-top-k-output-ranking-according-to-performance-in-gflopss).
* Support `void` as the D element in sm100 kernel epilogues.
* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
* Optimal code generation with CUDA toolkit versions 12.8U1.
## [3.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.8.0) (2025-01-25)
* Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
- [5th generation Blackwell Tensor Core instructions (TCGen05)](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/mma_traits_sm100.hpp) via CuTe MMA atoms.
- Extensions to [Tensor Memory Accelerator](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/copy_traits_sm100_tma.hpp) via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as [`tmem`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/pointer.hpp) across CuTe as a first class data locale.
- Exposure of [`tmem->rmem`, `rmem->tmem` and `smem->tmem data movement instructions`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/copy_traits_sm100.hpp) as copy atoms in CuTe.
- [`make_tmem_copy()`](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/copy_traits_sm100.hpp) utility method to ease creation of tiled copies for tmem copy atoms.
- Support for [new variants of LDSM on Blackwell](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/copy_traits_sm100.hpp) via CuTe Copy atoms.
* Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
- Various narrow precision [FP4, FP6, and FP8](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/exmy_base.h) formats as well as their [block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/float_subbyte.h)
- [Pipelines that implement Blackwell specific synchronization](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/pipeline/sm100_pipeline.hpp).
- [Cluster launch control API supporting preferred and fallback cluster shapes](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cluster_launch.hpp).
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/blackwell_cluster_launch_control.html) to implement dynamic persistence scheduling for [GEMMs](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- [Blackwell specific kernel layers](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
+ Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
+ Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
+ Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
* [Non-block scaled data types without support for pointer array and grouped GEMM with TMA](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp)
* [Non-block scaled data types with support for pointer array and grouped GEMM with TMA](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_mma_array_warpspecialized.hpp)
* [Block scaled data types without support for pointer array and grouped GEMM with TMA](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp)
* [Block scaled data types with support for pointer array and grouped GEMM with TMA](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp)
- Blackwell [collective mainloop for convolution kernels](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp) supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New [GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/dispatch_policy.hpp), [convolution](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/dispatch_policy.hpp), and [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
- [Blackwell epilogue that supports loading accumulators from `tmem`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/sm100_epilogue_tma_warpspecialized.hpp) and full set of EVT fusions.
* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
- Support for mixed input GEMM kernels on Hopper in the profiler.
* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.
* A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (`./cutlass_profiler --operation=GroupedGemm --help` for details).
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- [Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples](https://github.com/NVIDIA/cutlass/tree/main/examples/70_blackwell_gemm/), demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with [opt-in collective builder schedules showcasing available recipes](https://github.com/NVIDIA/cutlass/tree/main/examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu) for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
+ [NVFP4 inputs with BF16 output](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu)
+ [NVFP4 inputs with NVFP4 output](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72b_blackwell_nvfp4_nvfp4_gemm.cu)
+ [Mixed MXFP8 and MXFP6 inputs with BF16 output](https://github.com/NVIDIA/cutlass/tree/main/examples/72_blackwell_narrow_precision_gemm/72c_blackwell_mixed_mxfp8_bf16_gemm.cu)
- GEMM example demonstrating [Blackwell's new preferred cluster support via dynamic cluster shapes](https://github.com/NVIDIA/cutlass/tree/main/examples/73_blackwell_gemm_preferred_cluster/blackwell_gemm_preferred_cluster.cu) for increased occupancy.
- [GEMM with CLC based StreamK scheduler for load balancing](https://github.com/NVIDIA/cutlass/tree/main/examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu).
- Grouped GEMM for [vanilla FP8 data inputs](https://github.com/NVIDIA/cutlass/tree/main/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm.cu) and [NVFP4 block scaled inputs](https://github.com/NVIDIA/cutlass/tree/main/examples/75_blackwell_grouped_gemm/75_blackwell_grouped_gemm_block_scaled.cu).
- Convolution kernels for [fprop](https://github.com/NVIDIA/cutlass/tree/main/examples/76_blackwell_conv/76_blackwell_conv_fprop.cu), [dgrad](https://github.com/NVIDIA/cutlass/tree/main/examples/76_blackwell_conv/76_blackwell_conv_dgrad.cu), and [wgrad](https://github.com/NVIDIA/cutlass/tree/main/examples/76_blackwell_conv/76_blackwell_conv_wgrad.cu).
- [Fused multi-head attention fprop kernel](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
- A new BF16x9 GEMM [kernel](https://github.com/NVIDIA/cutlass/tree/main/examples/78_blackwell_emulated_bf16x9_gemm/78_blackwell_emulated_bf16x9_gemm.cu) that emulates FP32 GEMM (SGEMM) using BF16 operations.
* Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
- A set of new [Hopper grouped GEMM kernels](https://github.com/NVIDIA/cutlass/tree/main/examples/69_hopper_mixed_dtype_grouped_gemm/) that support mixed A and B datatypes.
- A new [Hopper FP8 GEMM with groupwise scaling](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu).
* Documentation updates:
- [Quickstart - instantiating a Blackwell block-scaled GEMM](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html#instantiating-a-blackwell-sm100-gemm-kernel).
- Detailed [Blackwell block-scaled GEMM functionality documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/blackwell_functionality.html)
- A new [functionality documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/functionality.html) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to [compatibility](https://docs.nvidia.com/cutlass/latest/overview.html#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](https://docs.nvidia.com/cutlass/latest/overview.html#target-architecture).
- Updates to [profiler documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html) for testing mixed input GEMM kernels on Hopper.
## [3.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.7.0) (2025-01-11)
- [Hopper blockwise scaling FP8 GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) uses 2D scaling tensor, assigning one value per threadblock. This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop. Details with figures are [here](https://github.com/NVIDIA/cutlass/pull/1932#issue-2645398439).
- [Distributed GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/65_distributed_gemm/65_distributed_gemm.cu) is a new (experimental) API which can turn existing CUTLASS GEMM kernels into pipelined Tensor Parallel GEMMs that run efficiently on NVLink-based network of GPUs. Its pipelining schedules can hide most of the communication behind computation, and relies on point-to-point communication, which can simply use CUDA runtime's peer device access feature. It also utilizes remote TMA loads and memcopies with CUDA graphs to handle communication primarily through the Copy Engine, leaving all SMs free for Hopper's persistent kernels. For more details you can refer to the [DistGEMM blog post](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b).
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
- Potential API breaking changes:
+ Fix `cute::UniversalCopy` for type safety.
+ No longer implicitly select `cute::SM80_CP_ASYNC_*` based on input tensors. This avoids implicit downstream synchronization requirements. To use `SM80_CP_ASYNC`, users must explicitly select the appropriate CopyAtom.
+ Fix `cute::SM80_CP_ASYNC_CACHEALWAYS`, `cute::SM80_CP_ASYNC_CACHEGLOBAL`, `cute::SM80_CP_ASYNC_CACHEALWAYS_ZFILL`, `cute::SM80_CP_ASYNC_CACHEGLOBAL_ZFILL` to avoid implicitly selecting `ZFILL` behavior on predication.
+ Remove `cute::copy_vec<T>` in favor of `cute::copy_aligned` and `cute::copy(AutoVectorizingCopyWithAssumedAlignment<NumBits>,...)`.
+ A refactor of default epilogue struct `DefaultEpilogue` [API](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/default_epilogue.hpp) to avoid reading non-void `ElementC` value for `ElementC = void` kernel.
- New CUTLASS profiler flags: `profiling-duration`, `min-iterations`, and `kernels-file` documented in [profiler.md](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html#cutlass-profiler).
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 12.6.
## [3.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.6.0) (2024-10-03)
- [Hopper structured sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu).
+ [FP16](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)
+ [FP8](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)
+ [INT8](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
+ [TF32](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
- [An improved mixed input GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
- [EVT nodes for Top-K selection and softmax](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](https://github.com/NVIDIA/cutlass/tree/main/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
- [Programmatic Dependent Launch](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/dependent_kernel_launch.html).
- [A new debugging tool, synclog](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/utilities.html#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
- A new TMA-enabled [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
- A SIMT-enabled pointer-array [epilogue](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
- A new [Ping-Pong kernel schedule for Grouped GEMM](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
- [A new instantiation strategy for CUTLASS profiler kernels](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html#instantiating-more-kernels-with-hopper).
- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/bfloat16.h)
- Fixed use of isnan on Windows for [`half_t`](https://github.com/NVIDIA/cutlass/tree/main/test/unit/core/functional.cu).
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 12.6.
## [3.5.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.1) (2024-07-25)
- [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial/wgmma_sm90.cu)
- [Exposure of L2 `cache_hint`s in TMA copy atoms](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/copy_sm90_tma.hpp#L48)
- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/cpp/profiler.md#gemm), and
[example 48](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- [TMA store based and EVT supported epilogues](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).
- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inferrence:
+ [FP16 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu#L269-L393) and [NT](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu#L269-L411).
+ [int8 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+ [int4 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+ [FP32 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu#L427-L642) and [NT](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu#L427-L456).
- [CUDA host adapter](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/cuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs.
- Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/generator.py).
- Support for residual add (beta != 0) in convolution kernels.
- A new convolution [epilogue](https://github.com/NVIDIA/cutlass/tree/main/examples/16_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output.
- A refactor of [include files throughout CUTLASS core directories](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](https://github.com/NVIDIA/cutlass/tree/main/test/self_contained_includes/CMakeLists.txt).
- [A guide for setting up VSCode to work well with CUTLASS](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/ide_setup.html) and [expanded code style guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/programming_guidelines.html).
- Better support for MSVC as a host compiler.
- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
## [3.5.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.0) (2024-04-09)
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + [TMA im2col](https://github.com/NVIDIA/cutlass/tree/main/include/cute/atom/copy_traits_sm90_im2col.hpp)
+ Native implementation in CUTLASS 3.x using CuTe, mirroring the [same design hierarchy as that of GEMMs](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/gemm_api_3x.html).
+ Support for 1D, 2D, and 3D convolutions in a [rank-agnostic fashion](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/convnd_problem_shape.hpp).
+ Support for [Fprop](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu), [Dgrad](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu), and [Wgrad](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu) algorithms
+ [CUTLASS profiler support](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/conv3x_emitter.py) for 2D and 3D convolutions implemented via the 3.x API.
+ NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
- Support for [Ada (SM89) FP8 tensor cores via the 2.x API](https://github.com/NVIDIA/cutlass/tree/main/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu). Requires CUDA 12.4 or newer.
- [Ampere gather/scatter convolution example](https://github.com/NVIDIA/cutlass/tree/main/examples/59_ampere_gather_scatter_conv/README.md) in CuTe and CUTLASS 3.x
+ Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
+ Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
+ [Ampere FP16 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16n_f16t_tensor_op_f32_sm80.cu) and [NT](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu#L227-L301), [Ampere INT8 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s8t_tensor_op_s32_sm80.cu#L392-L1342), [Ampere INT4 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s4t_tensor_op_s32_sm80.cu#L372-L934).
+ [Turing FP16 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16t_f16n_f16t_tensor_op_f32_sm75.cu#L55-L394), [Turing INT8 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s8t_s8n_s8t_tensor_op_s32_sm75.cu#L166-L537), [Turing INT4 TN](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_s4t_s4n_s4t_tensor_op_s32_sm75.cu#L310-L564).
- Updates to CuTe documentation for [`cute::Tensor<>`](./media/docs/cpp/cute/03_tensor.md), [MMA atoms](./media/docs/cpp/cute/0t_mma_atom.md), and an overhauled [CuTe GEMM tutorial series](https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial).
- Extensions to CuTe to support [L2 prefetching](https://github.com/NVIDIA/cutlass/tree/main/include/cute/algorithm/prefetch.hpp) and [TMA store+reductions](https://github.com/NVIDIA/cutlass/tree/main/include/cute/arch/copy_sm90_tma.hpp#L1337).
- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
- Fixes to greatly reduce build warnings.
- Updates and bugfixes from the community (thanks!)
## [3.4.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.4.1) (2024-02-14)
- Statically available [CUTLASS Version macros](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/version.h) that allow for handling API changes between CUTLASS releases on the users' side.
- Improvements for Hopper [Group-GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/57_hopper_grouped_gemm) and [Pointer-Array Batched GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/56_hopper_ptr_array_batched_gemm).
- Updates and bugfixes from the community (thanks!).
## [3.4.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.4.0) (2024-01-12)
* Expanded [Mixed-input Hopper GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm) support covering {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors.
* Performance improvements to [Mixed-input Hopper GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm)
* Beta release of [Pointer-Array Batched GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/56_hopper_ptr_array_batched_gemm) now available on Hopper GPUs utilizing TMA and WGMMA (requires CUDA 12.3 or above).
* Beta release of [Group-GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/57_hopper_grouped_gemm) utilizing TMA and WGMMA (requires CUDA 12.3 or above).
* [Ampere Sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/15_ampere_sparse_tensorop_gemm/ampere_sparse_tensorop_gemm_with_visitor.cu) supports Epilogue Visitor Tree (EVT) now.
* NamedBarriers usability improvement and list of [ReservedNamedBarriers](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/barrier.h) has been officially released.
* Improved CuTe documentation including improved clarity and depth of [Quickstart](./media/docs/cpp/cute/00_quickstart.md), [CuTe Layout](./media/docs/cpp/cute/01_layout.md), and [CuTe Layout Algebra](./media/docs/cpp/cute/02_layout_algebra.md). Associated code comments, post-conditions, and details in [CuTe Core Unit Tests](./test/unit/cute/core/) also improved.
## [3.3](https://github.com/NVIDIA/cutlass/releases/tag/v3.3.0) (2023-10-31)
* [Mixed-input Hopper GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input operand types.
* [Mixed-input Ampere GEMMs](https://github.com/NVIDIA/cutlass/pull/1084) with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8}, and upcast on operandA {s8, u8} x {fp16, bf16}.
* [Copy Async based Hopper GEMMs](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_alignx_tensor_op_f32_warpspecialized_cooperative.cu) - which support lower than 16B aligned input tensors.
* Kernel schedules and Builder support for mixed precision and Copy Async GEMMs with < 16B aligned input tensors.
* Profiler support for lower-aligned Hopper GEMMs.
* Performance Improvements to [Scatter-Gather Hopper Example](https://github.com/NVIDIA/cutlass/tree/main/examples/52_hopper_gather_scatter_fusion).
* Sub-Byte type fixes and improvements.
* EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See [SM90 EVT fusions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp) for details.
* Fusion support for backprop fusions including drelu, dgelu, and dbias.
* Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface
## [3.2.2](https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2) (2023-10-25)
* Minor patch for issue/1138
## [3.2.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1) (2023-09-22)
* Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
* SM80 EVT support in C++ and Python.
* Other SM90 epilogue improvements.
* Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See [1105](https://github.com/NVIDIA/cutlass/discussions/1105) for details.
* Making `tools/library/scripts` packageable - `tools/library/scripts` is now moving to `python/cutlass_library`. See the Python [README](https://github.com/NVIDIA/cutlass/tree/main/python/README.md) for details.
* SM90 TF32 kernel improvements for all layouts.
* SM90 rasterization direction support in the CUTLASS profiler.
* Improvement for CUTLASS profiler build times.
* Remove Python-C++ bindings.
## [3.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.2.0) (2023-08-03)
* New warp-specialized persistent FP8 GEMM kernel [kernel schedules](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) and [mainloops](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp) targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing [Hopper warp-specialized FP8 GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/54_hopper_fp8_warp_specialized_gemm). FP8 GEMMs come with a fast accumulation mode. When enabled, problem execution might be faster but at the cost of lower accuracy because intermediate results will not periodically be promoted to a higher precision.
* New [Epilogue Visitor Tree (EVT)](https://github.com/NVIDIA/cutlass/tree/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
* [Stream-K](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp) feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
* Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the [Tile Scheduler](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp).
* Improved performance for [warp-specialized TensorFloat-32 (TF32) GEMM kernels](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_tf32_tf32_f32_tensor_op_f32_gmma_rs_cluster_warpspecialized.cu) targeting Hopper TMA.
* [Hopper GEMM+Permute](https://github.com/NVIDIA/cutlass/tree/main/examples/53_hopper_gemm_permute/53_hopper_gemm_permute.cu), an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
* New CUTLASS 2D Convolution Python interface. New [example](https://github.com/NVIDIA/cutlass/tree/main/examples/python/03_basic_conv2d.ipynb) here.
* Support for Windows (MSVC) builds. Tested with Visual Studio 2019 v16.11.27 on Windows 10.0.
* Optimal performance using [**CUDA 12.2u1**](https://developer.nvidia.com/cuda-downloads)
* Updates and bugfixes from the community (thanks!)
## [3.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.1.0) (2023-04-14)
* New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details [here](https://github.com/NVIDIA/cutlass/tree/main/python/README.md) and new [examples](https://github.com/NVIDIA/cutlass/tree/main/examples/python).
* New [efficient epilogues](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative.cu#L783) using TMA for Hopper.
* Support for [fused epilogues](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_bias_elementwise.cu), such Bias, ReLU and GELU, using the new efficient epilogues.
* New [warp-specialized TensorFloat-32 (TF32) GEMM kernels](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/sm90_gemm_tf32_tf32_f32_tensor_op_f32_gmma_rs_cluster_warpspecialized.cu) targeting Hopper TMA.
* New [*warp-specialized persistent cooperative*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp) kernel design that allows for larger tile sizes and improves performance on Hopper.
* An [example](https://github.com/NVIDIA/cutlass/tree/main/examples/51_hopper_gett) showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
* Epilogue builders. Similar to mainloop builders (see [example 49](https://github.com/NVIDIA/cutlass/tree/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu)), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
* Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
* Performance optimizations for the [*warp-specialized persistent ping-pong*](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp) kernel.
* Changes to the [GEMM API 3.x](./media/docs/cpp/gemm_api_3x.md), involving the host-facing arguments and the underlying `Params` structs.
* [FMHA Backward Pass](https://github.com/NVIDIA/cutlass/tree/main/examples/41_fused_multi_head_attention/fused_multi_head_attention_backward.cu) from Meta xFormers.
* [Streamk GEMM with Broadcast](https://github.com/NVIDIA/cutlass/tree/main/examples/47_ampere_gemm_universal_streamk/ampere_gemm_universal_streamk_broadcast.cu) enables epilogue broadcast with StreamK GEMM.
* [Batched B2B GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion) now can run multiple Back-to-Back GEMM with the same problem size in parallel.
* [Batched Strided GEMV](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemv.cu) support both row major and column major input matrix.
* [Permute + GEMM fusion](https://github.com/NVIDIA/cutlass/tree/main/examples/39_gemm_permute) can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
* [Row Broadcast](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator_row_broadcast.h) can be fused in the epilogue.
* The GitHub branch is renamed from `master` to `main` in this release.
* Optimal performance using [**CUDA 12.1**](https://developer.nvidia.com/cuda-downloads)
* Updates and bugfixes from the community (thanks!)
## [3.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.0.0) (2023-01-23)
* [CuTe](./media/docs/cpp/cute/00_quickstart.md), a [new core library and backend](./include/cute) for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
* [A new conceptual operation hierarchy](./media/docs/cpp/cutlass_3x_design.md) that replaces the architecture-centric hierarchy of CUTLASS 2.x and [documentation for CUTLASS 3.0's GEMM API changes](./media/docs/cpp/gemm_api_3x.md).
* Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same [`device::GemmUniversalAdapter`](./include/cutlass/gemm/device/gemm_universal_adapter.h) and [`kernel::GemmUniversal`](./include/cutlass/gemm/kernel/gemm_universal.hpp) types, allowing users to include both APIs in the same translation units. More information can be found in the [3.x backwards compatibility section](./media/docs/cpp/cutlass_3x_backwards_compatibility.md).
* Updates to [Functionality](./media/docs/cpp/functionality.md) which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
* Updates to [Compatibility](./README.md#compatibility) Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and [Target Architecture](./README.md#target-architecture).
* New warp-specialized GEMM [kernel schedules](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp) and [mainloops](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp) targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
* Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
* [CUTLASS library integration](https://github.com/NVIDIA/cutlass/tree/main/tools/library/src/gemm_operation_3x.hpp) for 3.x API kernels built through the new `CollectiveBuilder` API, enabling CUTLASS profiler.
* Support for [Hopper GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm) through the new 3.0 API with CuTe-based exposure of the Hopper [Tensor Memory Accelerator](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor) and [WGMMA Tensor Core](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) features.
* Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples [48](https://github.com/NVIDIA/cutlass/tree/main/examples/48_hopper_warp_specialized_gemm), [49](https://github.com/NVIDIA/cutlass/tree/main/examples/49_hopper_gemm_schedules_with_collective_builder), and [50](https://github.com/NVIDIA/cutlass/tree/main/examples/50_hopper_gemm_with_epilogue_swizzle).
# CUTLASS 2.x
## [2.11.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.11.0) (2022-11-19)
* [Stream-K](https://github.com/NVIDIA/cutlass/tree/main/examples/47_ampere_gemm_universal_streamk), which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
* [Fused multi-head attention Kernel](https://github.com/NVIDIA/cutlass/tree/main/examples/41_fused_multi_head_attention). It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
* [Dual GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/45_dual_gemm), which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
* Hopper improves [double precision matrix multiplication](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f64n_f64t_f64t_tensor_op_f64_sm90.cu) by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
* [BLAS3](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/hemm_cf64_cf64_cf64_tensor_op_f64_sm90.cu) functions with Hoppers new double precision matrix multiplication instructions.
* [ELL Block Sparse GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/43_ell_block_sparse_gemm), which uses an [ELL matrix](https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/) to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
* Optimized [Group Conv](https://github.com/NVIDIA/cutlass/tree/main/examples/42_ampere_tensorop_group_conv) for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
* [Optimized DepthWise Conv](https://github.com/NVIDIA/cutlass/tree/main/examples/46_depthwise_simt_conv2dfprop/depthwise_simt_conv2dfprop.cu). Two new modes are added
* [kOptimized](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - use direct conv to compute instead of implicit GEMM.
* The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
* [kFixedStrideDilation](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/depthwise_conv2d_fprop_direct_conv_fixed_stride_dilation_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
* The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
* [Scripts](https://github.com/NVIDIA/cutlass/tree/main/examples/44_multi_gemm_ir_and_codegen) to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring [talk](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41606/).
* [FP8 data type definition](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/float8.h) and [conversion routines](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/numeric_conversion.h#L1274-2115).
* Updates and bugfixes from the community (thanks!). Big shout out to Meta's [xFormers](https://github.com/facebookresearch/xformers).
* **Deprecation announcement:** CUTLASS plans to deprecate the following:
* Maxwell and Pascal GPU architectures
* Ubuntu 16.04
* CUDA 10.2
## [2.10.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.10.0) (2022-08-23)
* [CUTLASS Python](https://github.com/NVIDIA/cutlass/tree/main/examples/40_cutlass_py) now supports GEMM, CONV, Group GEMM for different data types as well as different epilogue flavours.
* Optimizations for CUTLASS's [Grouped GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/24_gemm_grouped/gemm_grouped.cu) kernel. Threadblock scheduling part is improved. Some computation can be moved to the host side if applicable. [Grouped Syr2k](https://github.com/NVIDIA/cutlass/tree/main/examples/38_syr2k_grouped/syr2k_grouped.cu) kernels are added, too.
* Optimizations for [GEMM+Softmax](https://github.com/NVIDIA/cutlass/tree/main/examples/35_gemm_softmax). All the reduction computation is fused into the previous GEMM. More template arguments are provided to fine tune the performance.
* [Grouped GEMM for Multihead Attention](https://github.com/NVIDIA/cutlass/tree/main/examples/41_multi_head_attention). This general group gemm based MHA does not require the sequence length of all GEMMs to be the same which makes it most useful for natural language processing.
* [GEMM + Layer norm fusion for Ampere](https://github.com/NVIDIA/cutlass/tree/main/examples/37_gemm_layernorm_gemm_fusion/) splits the layernorm into two parts and both of them can be fused into the GEMMs before and after separately. In addition to use square sum to compute variance of layernorm, [Shift-K](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Computing_shifted_data) is provided if square sum raise numerical issues.
* [GEMM Epilogue Permutation Fusion](https://github.com/NVIDIA/cutlass/tree/main/examples/39_gemm_permute) can apply user provided permutation layout mapping in the GEMM epilogue.
* [Grouped convolution targeting implicit GEMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/group_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. The restrictions are: 1) input and output channel number should be multiple of group number. 2) split-K is not supported. The implementation has 2 modes:
* kSingleGroup: output channel per group is multiple of Threadblock tile N.
* kMultipleGroup: Threadblock tile N is multiple of output channel per group.
* [Depthwise separable convolution](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) introduces the first depthwise convolution which is also Analytical for now. The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number.
* Standalone [Layernorm](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/device_layernorm.h) and [Pooling](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util/device_nhwc_pooling.h) kernels.
* [Back-to-back GEMM/CONV](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion) relaxes the requirement that the first GEMM K dimension needs to be the multiple of Threadblock Tile K dimension.
* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
* Updates and bugfixes from the community (thanks!)
## [2.9.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.9.0) (2022-04-21)
* [First layer Convolution kernels](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_fixed_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) specialized for small channel counts and reduced alignment
* [Few channels](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_few_channels.h) specialization for reduced alignment capabilities
* [Fixed channels](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_fixed_channels.h) further specialized when channel count perfectly matches the access vector size
* [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu)
* [Python-based instance emitter](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/generator.py) in the CUTLASS Library and support in the Profiler
* [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores
* Supported types: f32, cf32, f64, cf64, tf32x3, complex tf32x3
* [HERK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/rank_k_operation.py)
* [SYRK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/rank_k_operation.py)
* [SYMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/symm_operation.py)
* [TRMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/trmm_f32n_f32t_f32t_tensor_op_fast_f32_ls_sm80.cu) with [emitter](https://github.com/NVIDIA/cutlass/tree/main/python/cutlass_library/trmm_operation.py)
* [Unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/testbed_rank_k_universal.h)
* [CUTLASS Python](https://github.com/NVIDIA/cutlass/tree/main/examples/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python)
* [Python-based runtime](https://github.com/NVIDIA/cutlass/tree/main/tools/library/scripts/rt.py) interoperable with existing emitters
* [GEMM + Softmax example](https://github.com/NVIDIA/cutlass/tree/main/examples/35_gemm_softmax)
* [Gather and Scatter Fusion with GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/36_gather_scatter_fusion) can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel.
* It can select random rows in a row major matrix.
* It can select random columns in a column major matrix.
* [Back-to-back GEMM/CONV](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion) fully supports buffering the first GEMM/CONV results in the shared memory for the latter one to use. It can eliminate register spill when the tile size is big. Additionally, bias vector add is supported in the first GEMM/CONV.
* Supported kernels: GEMM and CONV.
* Supported types: fp16 and int8.
* Supported architectures: Turing and Ampere.
* [Transposed Convolution](https://github.com/NVIDIA/cutlass/tree/main/examples/34_transposed_conv2d) (a.k.a Deconvolution) support which reuses Dgrad implementation.
* [Utility functions](https://github.com/NVIDIA/cutlass/tree/main/tools/util/include/cutlass/util) that can pad NHWC and convert between NCHW and NHWC.
* [Small alignment implicit gemm](https://github.com/NVIDIA/cutlass/issues/242) support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores in these kernels.
* Epilogue enhancement:
* Eliminate bank conflicts in int8 tensor core kernels.
* Half2 usage if epilogue compute type is fp16.
* More activation functions: Silu, Hardswish, Leaky Relu.
* New elementwise fusion pattern for [residual block](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/linear_combination_residual_block.h).
* [Group GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/24_gemm_grouped) thread block number calculation fix which helps to launch the intended number of threadblocks to fully occupy the GPUs.
* [Parallel GEMM splitk](https://github.com/NVIDIA/cutlass/pull/277) support in the CUTLASS profiler.
* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
* Updates and bugfixes from the community (thanks!)
## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)
* **TF32x3:** emulated single-precision using Tensor Cores
* 45+ TFLOPs on NVIDIA A100
* [GEMM SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu) (real)
* [COMPLEX GEMM SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/29_3xtf32_complex_gemm.cu) (complex)
* [Implicit GEMM Convolution SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/28_ampere_3xtf32_fast_accurate_tensorop_fprop/ampere_3xtf32_fast_accurate_tensorop_fprop.cu)
* **Mainloop fusion for Convolution:** convolution with fused per-channel scale-bias-relu
* [Conv Fprop SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/25_ampere_fprop_mainloop_fusion/ampere_fprop_mainloop_fusion.cu)
* [Conv WGrad SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu)
* [cutlass::conv::device::ImplicitGemmConvolutionFusion](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h)
* **Grouped GEMM:** similar to batched GEMM with distinct problem size per group
* [SDK example](https://github.com/NVIDIA/cutlass/tree/main/examples/24_gemm_grouped) with performance comparison with Batched Strided GEMM
* [cutlass::gemm::device::GemmGrouped](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/gemm/device/gemm_grouped.h)
* [Implicit GEMM Convolution fusion](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion/) supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
* Optimal performance using [**CUDA 11.5**](https://developer.nvidia.com/cuda-downloads)
* Updates from the community (thanks!)
* **Deprecation announcement:** CUTLASS plans to deprecate the following:
* Maxwell and Pascal GPU architectures
* Ubuntu 16.04
* CUDA 10.2
## [2.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.7.0) (2021-09-24)
* Mainloop fusion for GEMM: [summation over A or B](https://github.com/NVIDIA/cutlass/tree/main/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu)
* [Strided DGRAD (optimized iterators)](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/kernel/default_conv2d_dgrad.h)
* [Half-precision GELU_taylor activation functions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/activation.h#L196)
* Use these when accumulation and epilogue compute types are all `cutlass::half_t`
* Tuning and bug fixes to [fused GEMM + GEMM example](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion/)
* Support for smaller than 128b aligned Convolutions: [see examples](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272)
* Caching of results to accelerate Convolution [unit tests](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/cache_testbed_output.h)
* Can be enabled or disabled by running `cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF`
* Corrections and bug fixes reported by the CUTLASS community
* Thank you for filing these issues!
## [2.6.1](https://github.com/NVIDIA/cutlass/releases/tag/v2.6.1) (2021-09-03)
* Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
* Tuning for GEMMs fused with partial reductions
* Corrections and bug fixes reported by the CUTLASS community
* Thank you for filing these issues!
## [2.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.6.0) (2021-07-22)
* Optimal performance when compiled with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit)
* Adopt the new L2 prefetch feature in [cp.async](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/memory.h) and [global load](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/arch/memory_sm80.h)
* Fused operators with GEMM and Convolution
* [Fused broadcast in epilogue](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu)
* [Fused partial reduction in epilogue](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu)
* 64b tensor strides and leading dimensions support for GEMMs
* Affine rank=2 matrix layouts
* Row stride and column stride for matrices using [cutlass::layout::AffineRank2](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/layout/matrix.h)
* Support [FP64 tensor core](https://github.com/NVIDIA/cutlass/tree/main/examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu) and SIMT GEMM.
* [Batched GEMV](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemv.cu) preview implementation
* [New strided Dgrad](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) implementation
* Accelerates over previous implementation by cutting down redundant math by 4x
* Support using new `Dy` and `w` analytic iterators and existing `cutlass::conv::device::ImplicitGemmConvolution` interface
* Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
* Updates to [quaternion.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/quaternion.h) and [functional.h](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/functional.h)
* SDK Example for [GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](https://github.com/NVIDIA/cutlass/tree/main/examples/22_quaternion_conv/quaternion_conv.cu)
* [Unit tests for GEMM](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/simt_qgemm_nn_sm50.cu) and [Convolution](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv2d_fprop_implicit_gemm_qf32nhwc_qf32nhwc_qf32nhwc_simt_f32_sm50.cu)
* Many improvements to the epilogue.
* Provide an [option](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/threadblock/epilogue.h) to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
* Performance improvement for FP16 tensor core kernels
* Bug fixes
* Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
* Updated minimum CUDA Toolkit requirement to 10.2
* [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit) recommended
* Corrections and bug fixes reported by the CUTLASS community
* Thank you for filing these issues!
## [2.5.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.5.0) (2021-02-26)
* Tensor reductions
* _m_-to-_n_ reductions of tensors with affine layout
* [Specializations](https://github.com/NVIDIA/cutlass/tree/main/test/unit/reduction/device/tensor_reduce_contiguous.cu) for reductions including contiguous dimension
* [Specializations](https://github.com/NVIDIA/cutlass/tree/main/test/unit/reduction/device/tensor_reduce_strided.cu) for reductions excluding contiguous dimension
* Custom reduction functors such as `cutlass::logical_and`
* Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
* Optimizations for 3-D convolution
* [Optimized tile iterators](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h) using precomputed delta table for 3-D convolution
* Full coverage of [forward](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv3d_fprop_implicit_gemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_f32_sm80.cu) and [backwards](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device/conv3d_dgrad_implicit_gemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_f32_sm80.cu) passes for 3D convolution
* [Fused Convolution+Convolution example](https://github.com/NVIDIA/cutlass/tree/main/examples/13_two_tensor_op_fusion/README.md)
* Corrections and bug fixes reported by the CUTLASS community
* Thank you for filing these issues!
## [2.4.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.4.0) (2020-11-19)
* Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
* Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
* Data type: FP32, complex<FP32>, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
* Spatial dimensions: 1-D, 2-D, and 3-D
* Layout: NHWC, NCxHWx
* Implicit GEMM convolution components:
* Global memory iterators supporting Fprop, Dgrad, and Wgrad
* `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
* `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
* [Documentation](./media/docs/cpp/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
## [2.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.3.0) (2020-09-23)
* [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
* [Sparse Tensor Core GEMM kernels](https://github.com/NVIDIA/cutlass/tree/main/test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu):
* Direct access to Sparse Tensor Cores and maximum performance via [`mma.sp.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends)
* Fast SGEMM targeting GeForce RTX 30-series CUDA Cores
* Minor Features:
* [Activation functions](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/activation.h) such as [GeLU](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/linear_combination_gelu.h) and [Sigmoid](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/epilogue/thread/linear_combination_sigmoid.h)
* Small [matrix](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/matrix.h) and [quaternion](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/quaternion.h) template classes in device code
* [Floating-point constants](https://github.com/NVIDIA/cutlass/tree/main/include/cutlass/constants.h)
* NVIDIA Ampere GPU Architecture examples and documentation:
* [Tensor Float 32](https://github.com/NVIDIA/cutlass/tree/main/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu) and
* [Sparse Tensor Cores](https://github.com/NVIDIA/cutlass/tree/main/examples/15_ampere_sparse_tensorop_gemm/ampere_sparse_tensorop_gemm.cu)
* Documentation added on CUTLASS [efficient row-major epilogue](./media/docs/cpp/gemm_api.md#efficient-epilogue)
## [2.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.2.0) (2020-06-08)
* [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
* Fast Tensor Core operations:
* Maximum performance via [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma-and-friends)
* Tensor Float 32, BFloat16, and double-precision data types
* Mixed integer data types (int8, int4, bin1)
* Asynchronous copy for deep software pipelines via [`cp.async`](https://docs.nvidia.com/cuda/parallel-thread-execution)
* Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745) (free registration required)
* Features:
* SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
* Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
* Gaussian complex GEMMs using 3m complex multiply algorithm
* Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
* Policy updates:
* [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) needed to enable NVIDIA Ampere Architecture features
* Disabled F16C by default for compatibility - enable on cmake command line with `-DCUTLASS_ENABLE_F16C=ON`
## [2.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.1.0) (2020-04-06)
* BLAS-style host-side API added to [CUTLASS Library](./media/docs/cpp/quickstart.md#cutlass-library)
* API to launch compiled kernel instances for GEMM and planar complex GEMM
* Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
* Computes complex matrix products on matrices stored as disjoint real and imaginary parts
* [SDK Examples of Planar Complex GEMMs](https://github.com/NVIDIA/cutlass/tree/main/examples/10_planar_complex/planar_complex.cu)
* Minor enhancements and bug fixes
## [2.0.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.0.0) (2019-11-19)
* Substantially refactored for
* Better performance, particularly for native Turing Tensor Cores
* Robust and durable templates spanning the design space
* Encapsulated functionality embodying modern C++11 programming techniques
* Optimized containers and data types for efficient, generic, portable device code
* Updates to:
* [Quick start guide](./media/docs/cpp/quickstart.md)
* [Documentation](./README.md#documentation)
* [Utilities](./media/docs/cpp/utilities.md)
* [CUTLASS Profiler](./media/docs/cpp/profiler.md)
* Native Turing Tensor Cores
* Efficient GEMM kernels targeting Turing Tensor Cores
* Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
* Coverage of existing CUTLASS functionality
* GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
* Volta Tensor Cores through native mma.sync and through WMMA API
* Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
* Batched GEMM operations
* Complex-valued GEMMs
* **Note: a host compiler supporting C++11 or greater is required.**
# CUTLASS 1.x
## [1.3.2](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.2) (2019-07-09)
* Performance improvement for Volta Tensor Cores TN and TT layouts.
## [1.3.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.1) (2019-04-09)
* Corrected NVRTC unit tests.
## [1.3.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.3.0) (2019-03-20)
* Efficient GEMM kernel targeting Volta Tensor Cores via `mma.sync` instruction added in CUDA 10.1.
## [1.2.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.2.0) (2018-10-26)
* Parallelized reductions across threadblocks ("Split-K")
* Improved IGEMM performance
* Batched strided WMMA GEMMs
## [1.1.0](https://github.com/NVIDIA/cutlass/releases/tag/v1.1.0) (2018-09-19)
* Turing Features
* WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit
* Batched Strided GEMM
* Threadblock rasterization strategies
* Improved performance for adverse problem sizes and data layouts
* Extended CUTLASS Core comonents
* Tensor views support arbitrary matrix and tensor layouts
* Zip iterators for structuring multiple data streams
* Enhanced CUTLASS utilities
* Reference code for tensor operations in host and device code
* Added HostMatrix<> for simplified matrix creation
* Examples
* Basic GEMM, tensor views, CUTLASS utilities, batched GEMM, WMMA GEMM
## [1.0.1](https://github.com/NVIDIA/cutlass/releases/tag/v1.0.1) (2018-06-11)
* Intra-threadblock reduction added for small threadblock tile sizes
* sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
* igemm_32x32x128
* GEMM _K_ residue handled during prologue prior to mainloop
* Replaced Google Test copy with submodule. Use `git submodule init --recursive --update`
## [1.0.0](https://github.com/NVIDIA/cutlass/commit/2028ebe120aab22bfd0b2baf8902d4c9627eb33f) (2018-05-16)
* Substantial rewrite to accommodate new architecture
* Kernels: SGEMM, DGEMM, IGEMM, HGEMM, WMMA GEMM
* Unit and performance tests
## [0.0.1](https://github.com/NVIDIA/cutlass/commit/d08ba8ac46e2fa3f745e070c390182edb56b2e91) (2017-12-04)
* Initial release
## Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```

116
CITATION.cff Normal file
View File

@ -0,0 +1,116 @@
cff-version: 1.2.0
title: CUTLASS
message: >-
If you use this software, please cite using the
following metadata.
type: software
authors:
- given-names: Vijay
family-names: Thakkar
email: vithakkar@nvidia.com
affiliation: NVIDIA
- given-names: Pradeep
family-names: Ramani
email: prramani@nvidia.com
affiliation: NVIDIA
- given-names: Cris
family-names: Cecka
email: ccecka@nvidia.com
affiliation: NVIDIA
- given-names: Aniket
family-names: Shivam
email: ashivam@nvidia.com
affiliation: NVIDIA
- given-names: Honghao
family-names: Lu
email: honghaol@nvidia.com
affiliation: NVIDIA
- given-names: Ethan
family-names: Yan
email: etyan@nvidia.com
affiliation: NVIDIA
- given-names: Jack
family-names: Kosaian
email: jkosaian@nvidia.com
affiliation: NVIDIA
- given-names: Mark
family-names: Hoemmen
email: mhoemmen@nvidia.com
affiliation: NVIDIA
- given-names: Haicheng
family-names: Wu
email: haichengw@nvidia.com
affiliation: NVIDIA
- given-names: Andrew
family-names: Kerr
email: akerr@nvidia.com
affiliation: NVIDIA
- given-names: Matt
family-names: Nicely
email: mnicely@nvidia.com
affiliation: NVIDIA
- given-names: Duane
family-names: Merrill
email: dumerrill@nvidia.com
affiliation: NVIDIA
- given-names: Dustyn
family-names: Blasig
email: dblasig@nvidia.com
affiliation: NVIDIA
- given-names: Aditya
family-names: Atluri
email: aatluri@nvidia.com
affiliation: NVIDIA
- given-names: Fengqi
family-names: Qiao
email: fqiao@nvidia.com
affiliation: NVIDIA
- given-names: Piotr
family-names: Majcher
email: pmajcher@nvidia.com
affiliation: NVIDIA
- given-names: Paul
family-names: Springer
email: pspringer@nvidia.com
affiliation: NVIDIA
- given-names: Markus
family-names: Hohnerbach
affiliation: NVIDIA
email: mhohnerbach@nvidia.com
- given-names: Jin
family-names: Wang
email: jinw@nvidia.com
affiliation: NVIDIA
- given-names: Manish
family-names: Gupta
affiliation: Google
email: manigupta@google.com
repository-code: 'https://github.com/NVIDIA/cutlass'
abstract: >-
CUTLASS is a collection of CUDA C++ template
abstractions for implementing high-performance
matrix-multiplication (GEMM) and related
computations at all levels and scales within CUDA.
It incorporates strategies for hierarchical
decomposition and data movement similar to those
used to implement cuBLAS and cuDNN. CUTLASS
decomposes these "moving parts" into reusable,
modular software components abstracted by C++
template classes. These thread-wide, warp-wide,
block-wide, and device-wide primitives can be
specialized and tuned via custom tiling sizes, data
types, and other algorithmic policy. The resulting
flexibility simplifies their use as building blocks
within custom kernels and applications.
keywords:
- 'cutlass, tensor cores, cuda, cute, nvidia, gpu, linear algebra, matrix computations'
license: BSD-3-Clause
license-url: https://github.com/NVIDIA/cutlass/blob/v3.0.0/LICENSE.txt
version: '3.0.0'
date-released: '2023-01-23'
identifiers:
- type: url
value: "https://github.com/NVIDIA/cutlass/tree/v3.0.0"
description: The GitHub release URL of tag 3.0.0

View File

@ -1,26 +0,0 @@
# A small utility function which generates a C-header from an input file
function(FILE_TO_C_STRING FILENAME VARIABLE_NAME OUTPUT_STRING ZERO_TERMINATED)
FILE(READ "${FILENAME}" HEX_INPUT HEX)
if (${ZERO_TERMINATED})
string(APPEND HEX_INPUT "00")
endif()
string(REGEX REPLACE "(....)" "\\1\n" HEX_OUTPUT ${HEX_INPUT})
string(REGEX REPLACE "([0-9a-f][0-9a-f])" "0x\\1," HEX_OUTPUT ${HEX_OUTPUT})
set(HEX_OUTPUT "static char const ${VARIABLE_NAME}[] = {\n ${HEX_OUTPUT}\n};\n")
set(${OUTPUT_STRING} "${HEX_OUTPUT}" PARENT_SCOPE)
endfunction()
message("Create header file for ${FILE_IN}")
message("Create header file for ${FILE_OUT}")
file_to_c_string(${FILE_IN} ${VARIABLE_NAME} OUTPUT_STRING ZERO_TERMINATED)
set(RESULT "#pragma once\n")
string(APPEND RESULT "namespace cutlass {\n")
string(APPEND RESULT "namespace nvrtc {\n")
string(APPEND RESULT "${OUTPUT_STRING}")
string(APPEND RESULT "} // namespace nvrtc\n")
string(APPEND RESULT "} // namespace cutlass\n")
file(WRITE "${FILE_OUT}" "${RESULT}")

1273
CMakeLists.txt Normal file → Executable file

File diff suppressed because it is too large Load Diff

203
CONTRIBUTORS.md Normal file
View File

@ -0,0 +1,203 @@
![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS")
[README](./README.md#documentation) > **Contributors**
# CUTLASS C++ Developers **
Andrew Kerr<br />
Paul Springer<br />
Dustyn Blasig<br />
Albert Xu<br />
Junkai Wu<br />
Xiuxia Zhang<br />
Haicheng Wu<br />
Jack Yang<br />
Pradeep Ramani<br />
Aditya Atluri<br />
Han Li<br />
Nick Zhao<br />
Ivan Yin<br />
Yu-Jung Chen<br />
Markus Hoehnerbach<br />
Honghao Lu<br />
Mihir Awatramani<br />
Hao Sheng<br />
Zekun Fan<br />
Aniket Shivam<br />
Siyu Liu<br />
Richard Cai<br />
Vikas Gupta<br />
Ethan Yan<br />
Vijay Thakkar<br />
Cris Cecka<br />
Lawrence Ryan<br />
Qun Song<br />
Daniel Ricketts<br />
dePaul Miller<br />
Yuhan Li<br />
Saman Ashkiani<br />
Jack Chen<br />
Shang Zhang<br />
Petrick Liu<br />
Questa Wang<br />
Pramod Shenoy<br />
Jack Kosaian<br />
Yujia Zhai<br />
Zhaodong Chen<br />
Manas Sahni<br />
Shunfan Shao<br />
Fengqi Qiao<br />
Serif Yesil<br />
Aragorn Guan<br />
Heidi He<br />
Xiao Song<br />
Sergey Klevtsov<br />
Jiang Shao<br />
Ruqing Xu<br />
Mengyu Guo<br />
Tao Xie<br />
Linfeng Zheng<br />
Harrison Barclay<br />
Wenfei Tang<br />
Diksha Gohlyan<br />
Alexander Zhurkevich<br />
Siyuan Fu<br />
Hua Huang<br />
Xiufan Liang<br />
Ian Tramble<br />
Ali Hassani<br />
Shreya Gaur<br />
** _The list is sorted in order of the author's first contribution to the CUTLASS project._
# CUTLASS DSL Developers ***
Albert Di<br />
Albert Xu<br />
Anakin Zheng<br />
Arvin Jou<br />
Brandon Sun<br />
Chenyang Xu<br />
Chunyu Wang<br />
Cris Cecka<br />
dePaul Miller<br />
Edward Cao<br />
Fung Xie<br />
Guray Ozen<br />
Hao Hu<br />
Hong Wang<br />
Jeremy Furtek<br />
Jie Fang <br />
JingZe Cui<br />
Kihiro Bando<br />
Linfeng Zheng<br />
Longsheng Du<br />
Mina Sun<br />
Mindy Li<br />
Pradeep Ramani<br />
Questa Wang<br />
Serif Yesil<br />
Tao Xie<br />
Tina Li<br />
Vicki Wang<br />
Vincent Zhang<br />
Vijay Thakkar<br />
Xiao Dong<br />
Xiaolei Shi<br />
Xinyu Wang<br />
Yihan Chen<br />
Yuhan Li<br />
Zekun Fan<br />
*** _Sorted in alphabetical order._
# CuTe Developers
Cris Cecka<br />
Vijay Thakkar<br />
# CUTLASS Product Manager
Matthew Nicely<br />
# Former CUTLASS Developers
Manish Gupta<br />
Duane Merrill<br />
Piotr Majcher<br />
Naila Farooqui<br />
Mark Hoemmen<br />
Rawn Henry<br />
Jin Wang<br />
Timmy Liu<br />
Manikandan Ananth<br />
David Tanner<br />
# Acknowledgements
Tri Dao<br />
Jay Shah<br />
Mehdi Amini<br />
Larry Wu<br />
Justin Holewinski<br />
Timothy Costa<br />
Julien Demouth<br />
Brian Fahs<br />
Michael Garland<br />
Michael Goldfarb<br />
Mostafa Hagog<br />
Fei Hu<br />
Alan Kaatz<br />
Wei Liu<br />
Tim Martin<br />
Kevin Siu<br />
Markus Tavenrath<br />
John Tran<br />
Yang Xu<br />
Scott Yokim<br />
Girish Bharambe<br />
Luke Durant<br />
Carter Edwards<br />
Olivier Giroux<br />
Stephen Jones<br />
Rishkul Kulkarni<br />
Bryce Lelbach<br />
Joel McCormack<br />
Kyrylo Perelygin<br />
Sean Treichler<br />
# Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```

369
CUDA.cmake Normal file
View File

@ -0,0 +1,369 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
if (CUDA_COMPILER MATCHES "[Cc]lang")
message(WARNING "CUDA_COMPILER flag is deprecated, set CMAKE_CUDA_COMPILER to desired compiler executable.")
set(__CLANG_DEVICE_COMPILATION_REQUESTED ON)
elseif(CUDA_COMPILER)
message(WARNING "Deprecated flag CUDA_COMPILER used with unknown argument ${CUDA_COMPILER}, ignoring.")
endif()
if (__CLANG_DEVICE_COMPILATION_REQUESTED AND NOT DEFINED CMAKE_CUDA_COMPILER)
set(CMAKE_CUDA_COMPILER clang++) # We will let the system find Clang or error out
endif()
enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
if(NOT CUDA_VERSION)
# For backward compatibility with older CMake code.
set(CUDA_VERSION ${CUDAToolkit_VERSION})
set(CUDA_VERSION_MAJOR ${CUDAToolkit_VERSION_MAJOR})
set(CUDA_VERSION_MINOR ${CUDAToolkit_VERSION_MINOR})
endif()
if(NOT CUDA_TOOLKIT_ROOT_DIR)
# In some scenarios, such as clang device compilation, the toolkit root may not be set, so we
# force it here to the nvcc we found via the CUDAToolkit package.
get_filename_component(CUDA_TOOLKIT_ROOT_DIR "${CUDAToolkit_NVCC_EXECUTABLE}/../.." ABSOLUTE)
endif()
if (CMAKE_CUDA_COMPILER_ID MATCHES "(nvcc|[Nn][Vv][Ii][Dd][Ii][Aa])")
set(CUTLASS_NVCC_DEVICE_COMPILE ON CACHE BOOL "Using nvcc tools for device compilation")
elseif (CMAKE_CUDA_COMPILER_ID MATCHES "[Cc]lang")
set(CUTLASS_CLANG_DEVICE_COMPILE ON CACHE BOOL "Using Clang tools for device compilation")
else()
message(FATAL_ERROR "Unknown device-side compiler ${CMAKE_CUDA_COMPILER_ID} found. Set CMAKE_CUDA_COMPILER to either nvcc or clang++.")
endif()
if (CUTLASS_CLANG_DEVICE_COMPILE AND CMAKE_VERSION VERSION_LESS_EQUAL "3.30")
message(FATAL_ERROR "Clang device compilation for CUTLASS requires CMake 3.30 or higher.")
endif()
if (CUDA_VERSION VERSION_LESS 9.2)
message(FATAL_ERROR "CUDA 9.2+ required, found ${CUDA_VERSION}.")
endif()
find_library(
CUDART_LIBRARY cudart
PATHS
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
lib/x86_64-linux-gnu
lib/x64
lib64
lib
NO_DEFAULT_PATH
# We aren't going to search any system paths. We want to find the runtime
# in the CUDA toolkit we're building against.
)
if(NOT TARGET cudart AND CUDART_LIBRARY)
message(STATUS "CUDART: ${CUDART_LIBRARY}")
if(WIN32)
add_library(cudart STATIC IMPORTED GLOBAL)
# Even though we're linking against a .dll, in Windows you statically link against
# the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
# from the PATH search.
else()
add_library(cudart SHARED IMPORTED GLOBAL)
endif()
add_library(nvidia::cudart ALIAS cudart)
set_property(
TARGET cudart
PROPERTY IMPORTED_LOCATION
${CUDART_LIBRARY}
)
elseif(TARGET cudart)
message(STATUS "CUDART: Already Found")
else()
message(STATUS "CUDART: Not Found")
endif()
find_library(
CUDA_DRIVER_LIBRARY cuda
PATHS
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
lib/x86_64-linux-gnu
lib/x64
lib64
lib
lib64/stubs
lib/stubs
NO_DEFAULT_PATH
# We aren't going to search any system paths. We want to find the runtime
# in the CUDA toolkit we're building against.
)
if(NOT TARGET cuda_driver AND CUDA_DRIVER_LIBRARY)
message(STATUS "CUDA Driver: ${CUDA_DRIVER_LIBRARY}")
if(WIN32)
add_library(cuda_driver STATIC IMPORTED GLOBAL)
# Even though we're linking against a .dll, in Windows you statically link against
# the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
# from the PATH search.
else()
add_library(cuda_driver SHARED IMPORTED GLOBAL)
endif()
add_library(nvidia::cuda_driver ALIAS cuda_driver)
set_property(
TARGET cuda_driver
PROPERTY IMPORTED_LOCATION
${CUDA_DRIVER_LIBRARY}
)
elseif(TARGET cuda_driver)
message(STATUS "CUDA Driver: Already Found")
else()
message(STATUS "CUDA Driver: Not Found")
endif()
find_library(
NVRTC_LIBRARY nvrtc
PATHS
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
lib/x64
lib64
lib
NO_DEFAULT_PATH
# We aren't going to search any system paths. We want to find the runtime
# in the CUDA toolkit we're building against.
)
if(NOT TARGET nvrtc AND NVRTC_LIBRARY)
message(STATUS "NVRTC: ${NVRTC_LIBRARY}")
if(WIN32)
add_library(nvrtc STATIC IMPORTED GLOBAL)
# Even though we're linking against a .dll, in Windows you statically link against
# the .lib file found under lib/x64. The .dll will be loaded at runtime automatically
# from the PATH search.
else()
add_library(nvrtc SHARED IMPORTED GLOBAL)
endif()
add_library(nvidia::nvrtc ALIAS nvrtc)
set_property(
TARGET nvrtc
PROPERTY IMPORTED_LOCATION
${NVRTC_LIBRARY}
)
elseif(TARGET nvrtc)
message(STATUS "NVRTC: Already Found")
else()
message(STATUS "NVRTC: Not Found")
endif()
include_directories(SYSTEM ${CUDA_INCLUDE_DIRS})
# Some platforms (e.g. Visual Studio) don't add the CUDA include directories to the system include
# paths by default, so we add it explicitly here.
if (MSVC OR CUTLASS_LIBRARY_KERNELS MATCHES "all")
set(CUTLASS_UNITY_BUILD_ENABLED_INIT ON)
else()
set(CUTLASS_UNITY_BUILD_ENABLED_INIT OFF)
endif()
set(CUTLASS_UNITY_BUILD_ENABLED ${CUTLASS_UNITY_BUILD_ENABLED_INIT} CACHE BOOL "Enable combined source compilation")
if (MSVC)
set(CUTLASS_UNITY_BUILD_BATCH_SIZE_INIT 8)
else()
set(CUTLASS_UNITY_BUILD_BATCH_SIZE_INIT 16)
endif()
set(CUTLASS_UNITY_BUILD_BATCH_SIZE ${CUTLASS_UNITY_BUILD_BATCH_SIZE_INIT} CACHE STRING "Batch size for unified source files")
function(cutlass_unify_source_files TARGET_ARGS_VAR)
set(options)
set(oneValueArgs BATCH_SOURCES BATCH_SIZE)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
if (NOT DEFINED TARGET_ARGS_VAR)
message(FATAL_ERROR "TARGET_ARGS_VAR parameter is required")
endif()
if (NOT DEFINED __BATCH_SOURCES)
set(__BATCH_SOURCES ON)
endif()
if (__BATCH_SOURCES AND NOT DEFINED __BATCH_SIZE)
set(__BATCH_SIZE ${CUTLASS_UNITY_BUILD_BATCH_SIZE})
endif()
if (CUTLASS_UNITY_BUILD_ENABLED AND __BATCH_SOURCES AND __BATCH_SIZE GREATER 1)
set(CUDA_FILE_ARGS)
set(TARGET_SOURCE_ARGS)
foreach(ARG ${__UNPARSED_ARGUMENTS})
if(${ARG} MATCHES ".*\.cu$")
list(APPEND CUDA_FILE_ARGS ${ARG})
else()
list(APPEND TARGET_SOURCE_ARGS ${ARG})
endif()
endforeach()
list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
while(NUM_CUDA_FILE_ARGS GREATER 0)
list(SUBLIST CUDA_FILE_ARGS 0 ${__BATCH_SIZE} CUDA_FILE_BATCH)
string(SHA256 CUDA_FILE_BATCH_HASH "${CUDA_FILE_BATCH}")
string(SUBSTRING ${CUDA_FILE_BATCH_HASH} 0 12 CUDA_FILE_BATCH_HASH)
set(BATCH_FILE ${CMAKE_CURRENT_BINARY_DIR}/${NAME}.unity.${CUDA_FILE_BATCH_HASH}.cu)
message(STATUS "Generating ${BATCH_FILE}")
file(WRITE ${BATCH_FILE} "// Unity File - Auto Generated!\n")
foreach(CUDA_FILE ${CUDA_FILE_BATCH})
get_filename_component(CUDA_FILE_ABS_PATH ${CUDA_FILE} ABSOLUTE)
file(APPEND ${BATCH_FILE} "#include \"${CUDA_FILE_ABS_PATH}\"\n")
endforeach()
list(APPEND TARGET_SOURCE_ARGS ${BATCH_FILE})
if (NUM_CUDA_FILE_ARGS LESS_EQUAL __BATCH_SIZE)
break()
endif()
list(SUBLIST CUDA_FILE_ARGS ${__BATCH_SIZE} -1 CUDA_FILE_ARGS)
list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
endwhile()
else()
set(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
endif()
set(${TARGET_ARGS_VAR} ${TARGET_SOURCE_ARGS} PARENT_SCOPE)
endfunction()
function(cutlass_add_library NAME)
set(options SKIP_GENCODE_FLAGS)
set(oneValueArgs EXPORT_NAME)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
add_library(${NAME} ${TARGET_SOURCE_ARGS} "")
cutlass_apply_standard_compile_options(${NAME})
if (NOT __SKIP_GENCODE_FLAGS)
cutlass_apply_cuda_gencode_flags(${NAME})
endif()
target_compile_features(
${NAME}
INTERFACE
cxx_std_11
)
get_target_property(TARGET_TYPE ${NAME} TYPE)
if (TARGET_TYPE MATCHES "SHARED")
set_target_properties(${NAME} PROPERTIES CUDA_RUNTIME_LIBRARY Shared)
elseif(TARGET_TYPE MATCHES "STATIC")
set_target_properties(${NAME} PROPERTIES CUDA_RUNTIME_LIBRARY Static)
endif()
if(__EXPORT_NAME)
add_library(nvidia::cutlass::${__EXPORT_NAME} ALIAS ${NAME})
set_target_properties(${NAME} PROPERTIES EXPORT_NAME ${__EXPORT_NAME})
endif()
endfunction()
function(cutlass_add_executable NAME)
set(options)
set(oneValueArgs CUDA_RUNTIME_LIBRARY)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
if (NOT DEFINED __CUDA_RUNTIME_LIBRARY)
set(__CUDA_RUNTIME_LIBRARY Shared)
endif()
set(__CUDA_RUNTIME_LIBRARY_ALLOWED None Shared Static)
if (NOT __CUDA_RUNTIME_LIBRARY IN_LIST __CUDA_RUNTIME_LIBRARY_ALLOWED)
message(FATAL_ERROR "CUDA_RUNTIME_LIBRARY value '${__CUDA_RUNTIME_LIBRARY}' is not in allowed list of '${__CUDA_RUNTIME_LIBRARY_ALLOWED}'")
endif()
cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
add_executable(${NAME} ${TARGET_SOURCE_ARGS})
cutlass_apply_standard_compile_options(${NAME})
cutlass_apply_cuda_gencode_flags(${NAME})
target_compile_features(
${NAME}
INTERFACE
cxx_std_11
)
set_target_properties(${NAME} PROPERTIES CUDA_RUNTIME_LIBRARY ${__CUDA_RUNTIME_LIBRARY})
endfunction()
function(cutlass_target_sources NAME)
set(options)
set(oneValueArgs)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
cutlass_unify_source_files(TARGET_SOURCE_ARGS ${__UNPARSED_ARGUMENTS})
target_sources(${NAME} ${TARGET_SOURCE_ARGS})
endfunction()

View File

@ -32,7 +32,7 @@ DOXYFILE_ENCODING = UTF-8
# title of most generated pages and in a few other places.
# The default value is: My Project.
PROJECT_NAME = "Cutlass"
PROJECT_NAME = "CUTLASS"
# The PROJECT_NUMBER tag can be used to enter a project or revision number. This
# could be handy for archiving the generated documentation or if some version
@ -51,14 +51,14 @@ PROJECT_BRIEF = "CUDA Templates for Linear Algebra Subroutines and Solv
# and the maximum width should not exceed 200 pixels. Doxygen will copy the logo
# to the output directory.
PROJECT_LOGO =
PROJECT_LOGO = media/images/cutlass-logo-small.png
# The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path
# into which the generated documentation will be written. If a relative path is
# entered, it will be relative to the location where doxygen was started. If
# left blank the current directory will be used.
OUTPUT_DIRECTORY = docs
OUTPUT_DIRECTORY = doxygen
# If the CREATE_SUBDIRS tag is set to YES, then doxygen will create 4096 sub-
# directories (in 2 levels) under the output directory of each output format and
@ -206,7 +206,7 @@ SEPARATE_MEMBER_PAGES = NO
# uses this value to replace tabs by spaces in code fragments.
# Minimum value: 1, maximum value: 16, default value: 4.
TAB_SIZE = 4
TAB_SIZE = 2
# This tag can be used to specify a number of aliases that act as commands in
# the documentation. An alias has the form:
@ -297,7 +297,7 @@ AUTOLINK_SUPPORT = YES
# diagrams that involve STL classes more complete and accurate.
# The default value is: NO.
BUILTIN_STL_SUPPORT = NO
BUILTIN_STL_SUPPORT = YES
# If you use Microsoft's C++/CLI language, you should set this option to YES to
# enable parsing support.
@ -734,7 +734,9 @@ WARN_LOGFILE =
# spaces.
# Note: If this tag is empty the current directory is searched.
INPUT = cutlass
INPUT = include/cutlass tools/util/include/cutlass/ tools/library/include/cutlass/
INPUT += media/docs/doxygen_mainpage.md
# This tag can be used to specify the character encoding of the source files
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
@ -870,7 +872,7 @@ FILTER_SOURCE_PATTERNS =
# (index.html). This can be useful if you have a project on for instance GitHub
# and want to reuse the introduction page also for the doxygen output.
USE_MDFILE_AS_MAINPAGE =
USE_MDFILE_AS_MAINPAGE = media/docs/doxygen_mainpage.md
#---------------------------------------------------------------------------
# Configuration options related to source browsing
@ -999,7 +1001,7 @@ GENERATE_HTML = YES
# The default directory is: html.
# This tag requires that the tag GENERATE_HTML is set to YES.
HTML_OUTPUT = generated-html
HTML_OUTPUT =
# The HTML_FILE_EXTENSION tag can be used to specify the file extension for each
# generated HTML page (for example: .htm, .php, .asp).
@ -1080,7 +1082,7 @@ HTML_EXTRA_FILES =
# Minimum value: 0, maximum value: 359, default value: 220.
# This tag requires that the tag GENERATE_HTML is set to YES.
HTML_COLORSTYLE_HUE = 82
HTML_COLORSTYLE_HUE = 100
# The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors
# in the HTML output. For a value of 0 the output will use grayscales only. A
@ -1088,7 +1090,7 @@ HTML_COLORSTYLE_HUE = 82
# Minimum value: 0, maximum value: 255, default value: 100.
# This tag requires that the tag GENERATE_HTML is set to YES.
HTML_COLORSTYLE_SAT = 100
HTML_COLORSTYLE_SAT = 50
# The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the
# luminance component of the colors in the HTML output. Values below 100
@ -1107,7 +1109,7 @@ HTML_COLORSTYLE_GAMMA = 80
# The default value is: YES.
# This tag requires that the tag GENERATE_HTML is set to YES.
HTML_TIMESTAMP = YES
HTML_TIMESTAMP = NO
# If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML
# documentation will contain sections that can be hidden and shown after the

188
EULA.txt Normal file
View File

@ -0,0 +1,188 @@
NVIDIA Software License Agreement
IMPORTANT NOTICE PLEASE READ AND AGREE BEFORE USING THE SOFTWARE
This software license agreement (“Agreement”) is a legal agreement between you, whether an individual or entity, (“you”) and NVIDIA Corporation (“NVIDIA”) and governs the use of the NVIDIA CUTLASS DSLs software and materials that NVIDIA delivers to you under this Agreement (“Software”).
NVIDIA and you are each a “party” and collectively the “parties.”
This Agreement can be accepted only by an adult of legal age of majority in the country in which the Software is used.
If you dont have the required age or authority to accept this Agreement, or if you dont accept all the terms and conditions of this Agreement, do not use the Software.
1. License Grants
1.1. License Grant to You. The Software made available by NVIDIA to you is licensed, not sold.
Subject to the terms of this Agreement, NVIDIA grants you a limited, non-exclusive, revocable, non-transferable, and non-sublicensable (except as expressly granted in this Agreement), license to:
a. install and use copies of the Software,
b. configure the Software using configuration files provided (if applicable),
c. modify and create derivative works of any sample or example source code NVIDIA delivers to you as part of the Software (“Derivatives”) (if applicable), and
d. distribute python files in the Software package in source format as incorporated into a software application subject to the following distribution requirements:
i. Your application must have material additional functionality, beyond the included portions of the Software.
ii. The distributable portions of the Software shall only be accessed by your application.
iii. The following notice shall be included in modifications and derivative works of sample source code distributed: “This software contains source code provided by NVIDIA Corporation.”
iv. Unless a developer tool is identified in this Agreement as distributable, it is delivered for your internal use only.
v. The terms under which you distribute your application must be consistent with the terms of this Agreement, including (without limitation) terms relating to the license grant and license restrictions and protection of NVIDIAs intellectual property rights.
vi. Additionally, you agree that you will protect the privacy, security and legal rights of your application users.
The foregoing (a) through (d) are, collectively, the “Purpose”, and the developed applications are only for use in systems with NVIDIA GPUs.
1.2. License Grant to NVIDIA. Subject to the terms of this Agreement, you grant NVIDIA and its affiliates a non-exclusive, perpetual, irrevocable, sublicensable, worldwide, royalty-free, fully paid-up and transferable license, under your intellectual property rights, to publicly perform, publicly display, reproduce, use, make, have made, sell, offer for sale, distribute (through multiple tiers of distribution), import, create derivative works of and otherwise commercialize and exploit at NVIDIAs discretion any Derivatives created by or for you.
You may, but are not required to, deliver any Derivatives to NVIDIA.
2. License Restrictions
Your license to use the Software and Derivatives is restricted as stated in this Section 2 (“License Restrictions”).
You will cooperate with NVIDIA and, upon NVIDIAs written request, you will confirm in writing and provide reasonably requested information to verify your compliance with the terms of this Agreement.
You may not:
2.1. Use the Software or Derivatives for any purpose other than the Purpose;
2.2. Sell, rent, sublicense, transfer, distribute or otherwise make available to others (except authorized users as stated in Section 3 (“Authorized Users”)) any portion of the Software or Derivatives, except as expressly granted in Section 1.1 (“License Grant to You”);
2.3. Reverse engineer, decompile, or disassemble the Software components provided in binary form, nor attempt in any other manner to obtain source code of such Software;
2.4. Modify or create derivative works of the Software, except as expressly granted in Section 1.1 (“License Grant to You”);
2.5. Change or remove copyright or other proprietary notices in the Software;
2.6. Bypass, disable, or circumvent any technical limitation, encryption, security, digital rights management or authentication mechanism in the Software;
2.7. Use the Software or Derivatives in any manner that would cause them to become subject to an open source software license, subject to the terms in Section 6 (“Components Under Other Licenses”);
2.8. Use the Software or Derivatives in violation of any applicable law or regulation in relevant jurisdictions
2.9. Indicate that a product or service developed with the Software or Derivatives is sponsored or endorsed by NVIDIA;
2.10. Replace any NVIDIA software components in the Software that are governed by this Agreement with other software that implements NVIDIA APIs;
2.11. Reverse engineer, decompile or disassemble any portion of the output generated using Software elements for the purpose of translating such output artifacts to target a non-NVIDIA platform; or
3. Authorized Users
You may allow employees and contractors of your entity or of your subsidiary(ies), and for educational institutions also enrolled students, to internally access and use the Software as authorized by this Agreement from your secure network to perform the work authorized by this Agreement on your behalf.
You are responsible for the compliance with the terms of this Agreement by your authorized users.
Any act or omission that if committed by you would constitute a breach of this Agreement will be deemed to constitute a breach of this Agreement if committed by your authorized users.
4. Pre-Release
Software versions identified as alpha, beta, preview, early access or otherwise as pre-release (“Pre-Release”) may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability and reliability standards relative to NVIDIA commercial offerings.
You use Pre-Release Software at your own risk. NVIDIA did not design or test the Software for use in production or business-critical systems.
NVIDIA may choose not to make available a commercial version of Pre-Release Software.
NVIDIA may also choose to abandon development and terminate the availability of Pre-Release Software at any time without liability.
5. Updates
NVIDIA may at any time and at its option, change, discontinue, or deprecate any part, or all, of the Software, or change or remove features or functionality, or make available patches, workarounds or other updates to the Software.
Unless the updates are provided with their separate governing terms, they are deemed part of the Software licensed to you under this Agreement, and your continued use of the Software is deemed acceptance of such changes.
6. Components Under Other Licenses
The Software may include or be distributed with components provided with separate legal notices or terms that accompany the components, such as open source software licenses and other license terms (“Other Licenses”).
The components are subject to the applicable Other Licenses, including any proprietary notices, disclaimers, requirements and extended use rights;
except that this Agreement will prevail regarding the use of third-party open source software, unless a third-party open source software license requires its license terms to prevail.
Open source software license means any software, data or documentation subject to any license identified as an open source license by the Open Source Initiative (http://opensource.org), Free Software Foundation (http://www.fsf.org) or other similar open source organization or listed by the Software Package Data Exchange (SPDX) Workgroup under the Linux Foundation (http://www.spdx.org).
7. Ownership
7.1. NVIDIA Ownership. The Software, including all intellectual property rights, is and will remain the sole and exclusive property of NVIDIA or its licensors.
Except as expressly granted in this Agreement, (a) NVIDIA reserves all rights, interests and remedies in connection with the Software, and (b) no other license or right is granted to you by implication, estoppel or otherwise.
7.2. Your Ownership. Subject to the rights of NVIDIA and its suppliers in the Software, which continue to be licensed as stated in this Agreement, even when incorporated in your products or services, and the extent permitted by applicable law, as between you and NVIDIA, you hold all rights, title and interest in and to your products, services and Derivatives you develop as permitted in this Agreement including their respective intellectual property rights.
8. Feedback
You may, but you are not obligated to, provide suggestions, requests, fixes, modifications, enhancements, or other feedback regarding the Software (collectively, “Feedback”).
Feedback, even if designated as confidential by you, will not create any confidentiality obligation for NVIDIA or its affiliates.
If you provide Feedback, you grant NVIDIA, its affiliates and its designees a non-exclusive, perpetual, irrevocable, sublicensable, worldwide, royalty-free, fully paid-up and transferable license, under your intellectual property rights, to publicly perform, publicly display, reproduce, use, make, have made, sell, offer for sale, distribute (through multiple tiers of distribution), import, create derivative works of and otherwise commercialize and exploit the Feedback at NVIDIAs discretion.
9. Termination
9.1. Termination. This Agreement will automatically terminate without notice from NVIDIA if you fail to comply with any of the terms in this Agreement or if you commence or participate in any legal proceeding against NVIDIA with respect to the Software.
Additionally, either party may terminate this Agreement at any time with thirty (30) days advance written notice to the other party.
9.2. Effect of Termination. Upon any expiration or termination of this Agreement, you will promptly (a) stop using and return, delete or destroy NVIDIA confidential information and all Software received under this Agreement, and (b) delete or destroy Derivatives created under this Agreement, unless an authorized NVIDIA representative provides prior written approval that you may keep a copy of the Derivatives solely for archival purposes.
Upon written request, you will certify in writing that you have complied with your obligations under this Section 9.2 (“Effect of Termination”).
9.3. Survival. Section 1.2 (“License Grant to NVIDIA”), Section 5 (“Updates”), Section 6 (“Components Under Other Licenses”), Section 7 (“Ownership”), Section 8 (“Feedback), Section 9.2 (“Effect of Termination”), Section 9.3 (“Survival”), Section 10 (“Disclaimer of Warranties”), Section 11 (“Limitation of Liability”), Section 12 (“Use in Mission Critical Applications”), Section 13 (“Governing Law and Jurisdiction”), Section 14 (“Indemnity”) and Section 15 (“General”) will survive any expiration or termination of this Agreement.
10. Disclaimer of Warranties
THE SOFTWARE IS PROVIDED BY NVIDIA AS-IS AND WITH ALL FAULTS. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, NVIDIA DISCLAIMS ALL WARRANTIES AND REPRESENTATIONS OF ANY KIND, WHETHER
EXPRESS, IMPLIED OR STATUTORY, RELATING TO OR ARISING UNDER THIS AGREEMENT, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF TITLE, NONINFRINGEMENT, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, USAGE OF TRADE AND COURSE OF DEALING. NVIDIA DOES NOT WARRANT OR ASSUME RESPONSIBILITY FOR THE ACCURACY OR COMPLETENESS OF ANY THIRD-PARTY INFORMATION, TEXT, GRAPHICS, LINKS CONTAINED IN THE SOFTWARE.
WITHOUT LIMITING THE FOREGOING, NVIDIA DOES NOT WARRANT THAT THE SOFTWARE WILL MEET YOUR REQUIREMENTS, ANY DEFECTS OR ERRORS WILL BE CORRECTED, ANY CERTAIN CONTENT WILL BE AVAILABLE; OR THAT THE SOFTWARE IS FREE OF VIRUSES OR OTHER HARMFUL COMPONENTS. NO INFORMATION OR ADVICE GIVEN BY NVIDIA WILL IN ANY WAY INCREASE THE SCOPE OF ANY WARRANTY EXPRESSLY PROVIDED IN THIS AGREEMENT.
NVIDIA does not warrant or assume responsibility for the accuracy or completeness of any third-party information, text, graphics or links contained in the Software.
11. Limitations of Liability
11.1. EXCLUSIONS. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY (I) INDIRECT, PUNITIVE, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, OR (ii) DAMAGES FOR (a) THE COST OF PROCURING SUBSTITUTE GOODS, OR (b) LOSS OF PROFITS, REVENUES, USE, DATA OR GOODWILL ARISING OUT OF OR RELATED TO THIS AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY, OR OTHERWISE, AND EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND EVEN IF A PARTYS REMEDIES FAIL THEIR ESSENTIAL PURPOSE.
11.2. DAMAGES CAP. ADDITIONALLY, TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, NVIDIAS TOTAL CUMULATIVE AGGREGATE LIABILITY FOR ANY AND ALL LIABILITIES, OBLIGATIONS OR CLAIMS ARISING OUT OF OR RELATED TO THIS AGREEMENT WILL NOT EXCEED FIVE U.S. DOLLARS (US$5).
12. Use in Mission Critical Applications
You acknowledge that the Software provided under this Agreement is not designed or tested by NVIDIA for use in any system or application where the use or failure of such system or application developed with NVIDIAs Software could result in injury, death or catastrophic damage (each, a “Mission Critical Application”).
Examples of Mission Critical Applications include use in avionics, navigation, autonomous vehicle applications, AI solutions for automotive products, military, medical, life support or other mission-critical or life-critical applications.
NVIDIA will not be liable to you or any third party, in whole or in part, for any claims or damages arising from these uses.
You are solely responsible for ensuring that systems and applications developed with the Software include sufficient safety and redundancy features and comply with all applicable legal and regulatory standards and requirements.
13. Governing Law and Jurisdiction
This Agreement will be governed in all respects by the laws of the United States and the laws of the State of Delaware, without regard to conflict of laws principles or the United Nations Convention on Contracts for the International Sale of Goods.
The state and federal courts residing in Santa Clara County, California will have exclusive jurisdiction over any dispute or claim arising out of or related to this Agreement, and the parties irrevocably consent to personal jurisdiction and venue in those courts;
except that either party may apply for injunctive remedies or an equivalent type of urgent legal relief in any jurisdiction.
14. Indemnity
By using the Software you agree to defend, indemnify and hold harmless NVIDIA and its affiliates and their respective officers, directors, employees and agents from and against any claims, disputes, demands, liabilities, damages, losses, costs and expenses arising out of or in any way connected with (i) products or services that have been developed or deployed with or use the Software, or claims that they violate laws, or infringe, violate, or misappropriate any third party right;
or (ii) use of the Software in breach of the terms of this Agreement.
15. General
15.1. Independent Contractors.
The parties are independent contractors, and this Agreement does not create a joint venture, partnership, agency, or other form of business association between the parties.
Neither party will have the power to bind the other party or incur any obligation on its behalf without the other partys prior written consent.
Nothing in this Agreement prevents either party from participating in similar arrangements with third parties.
15.2. No Assignment.
NVIDIA may assign, delegate or transfer its rights or obligations under this Agreement by any means or operation of law.
You may not, without NVIDIAs prior written consent, assign, delegate or transfer any of your rights or obligations under this Agreement by any means or operation of law, and any attempt to do so is null and void.
15.3. No Waiver.
No failure or delay by a party to enforce any term or obligation of this Agreement will operate as a waiver by that party, or prevent the enforcement of such term or obligation later.
15.4. Trade Compliance.
You agree to comply with all applicable export, import, trade and economic sanctions laws and regulations, as amended, including without limitation U.S. Export Administration Regulations and Office of Foreign Assets Control regulations.
You confirm (a) your understanding that export or reexport of certain NVIDIA products or technologies may require a license or other approval from appropriate authorities and (b) that you will not export or reexport any products or technology, directly or indirectly, without first obtaining any required license or other approval from appropriate authorities, (i) to any countries that are subject to any U.S. or local export restrictions (currently including, but not necessarily limited to, Belarus, Cuba, Iran, North Korea, Russia, Syria, the Region of Crimea, Donetsk Peoples Republic Region and Luhansk Peoples Republic Region);
(ii) to any end-user who you know or have reason to know will utilize them in the design, development or production of nuclear, chemical or biological weapons, missiles, rocket systems, unmanned air vehicles capable of a maximum range of at least 300 kilometers, regardless of payload, or intended for military end-use, or any weapons of mass destruction;
(iii) to any end-user who has been prohibited from participating in the U.S. or local export transactions by any governing authority;
or (iv) to any known military or military-intelligence end-user or for any known military or military-intelligence end-use in accordance with U.S. trade compliance laws and regulations.
15.5. Government Rights.
The Software, documentation and technology (“Protected Items”) are “Commercial products” as this term is defined at 48 C.F.R.
2.101, consisting of “commercial computer software” and “commercial computer software documentation” as such terms are used in, respectively, 48 C.F.R.
12.212 and 48 C.F.R. 227.7202 & 252.227-7014(a)(1). Before any Protected Items are supplied to the U.S. Government, you will (i) inform the U.S. Government in writing that the Protected Items are and must be treated as commercial computer software and commercial computer software documentation developed at private expense;
(ii) inform the U.S. Government that the Protected Items are provided subject to the terms of the Agreement;
and (iii) mark the Protected Items as commercial computer software and commercial computer software documentation developed at private expense.
In no event will you permit the U.S. Government to acquire rights in Protected Items beyond those specified in 48 C.F.R.
52.227-19(b)(1)-(2) or 252.227-7013(c) except as expressly approved by NVIDIA in writing.
15.6. Notices.
Please direct your legal notices or other correspondence to legalnotices@nvidia.com with a copy mailed to NVIDIA Corporation, 2788 San Tomas Expressway, Santa Clara, California 95051, United States of America, Attention: Legal Department.
If NVIDIA needs to contact you, you consent to receive the notices by email and agree that such notices will satisfy any legal communication requirements.
15.7. Severability.
If a court of competent jurisdiction rules that a provision of this Agreement is unenforceable, that provision will be deemed modified to the extent necessary to make it enforceable and the remainder of this Agreement will continue in full force and effect.
15.8. Amendment.
Any amendment to this Agreement must be in writing and signed by authorized representatives of both parties.
15.9. Construction.
The headings in the Agreement are included solely for convenience and are not intended to affect the meaning or interpretation of the Agreement.
As required by the context of the Agreement, the singular of a term includes the plural and vice versa.
15.10. Force Majeure.
Neither party will be liable during any period where an event or circumstance prevents or delays that party from performing its obligations under this Agreement and that event or circumstance: (i) is not within the reasonable control of that party and is not the result of that partys negligence, and (ii) cannot be overcome or avoided by that party using reasonably diligent efforts.
15.11. Entire Agreement.
Regarding the subject matter of this Agreement, the parties agree that (a) this Agreement constitutes the entire and exclusive agreement between the parties and supersedes all prior and contemporaneous communications and (b) any additional or different terms or conditions, whether contained in purchase orders, order acknowledgments, invoices or otherwise, will not be binding and are null and void.
(v. May 8, 2025)

View File

@ -1,23 +0,0 @@
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the NVIDIA CORPORATION nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

34
LICENSE.txt Normal file
View File

@ -0,0 +1,34 @@
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Certain files within this repository are subject to separate licensing terms:
- The files located in the `python/CuTeDSL` directory are licensed under the
NVIDIA End User License Agreement (EULA). Please refer to
https://docs.nvidia.com/cutlass/media/docs/pythonDSL/license.html
for the full terms.

104
PUBLICATIONS.md Normal file
View File

@ -0,0 +1,104 @@
# Publications Using Cutlass
## 2025
- ["Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts"](https://arxiv.org/abs/2502.19811). Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu. _arXiv_, February 2025.
- ["ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"](https://arxiv.org/abs/2502.02631). Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra. _arXiv_, February 2025.
- ["Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light"](https://arxiv.org/abs/2504.16922). Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi. _arXiv_, April 2025.
## 2024
- ["DeepSeek-V3 Technical Report"](https://arxiv.org/abs/2412.19437). DeepSeek-AI. _arXiv_, December 2024.
- ["ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"](https://arxiv.org/abs/2410.21465). Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. _arXiv_, October 2024.
- ["FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion"](https://arxiv.org/abs/2406.06858). Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu. _arXiv_, June 2024.
- ["EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree"](https://dl.acm.org/doi/10.1145/3620666.3651369). Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. _Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, April 2024.
- ["Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level"](https://arxiv.org/abs/2403.04690). Ali Hassani, Wen-Mei Hwu, Humphrey Shi. _arXiv_, March 2024.
## 2023
- ["A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library"](https://arxiv.org/abs/2312.11918). Ganesh Bikshandi, Jay Shah. _arXiv_, December 2023.
- ["Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS"](https://www.mdpi.com/2076-3417/13/24/13022). Xuanteng Huang, Xianwei Zhang, Panfei Yang, Nong Xiao. _Journal of Applied Sciences_, December 2023.
- ["A Speed Odyssey for Deployable Quantization of LLMs"](https://arxiv.org/abs/2311.09550). Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu, Xiangxiang Chu, Yerui Sun, Yuchen Xie. _arXiv_, November 2023.
- ["FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning"](https://arxiv.org/abs/2307.08691). Tri Dao. _Technical Report_, July 2023.
- ["MegaBlocks: Efficient Sparse Training with Mixture-of-Experts"](https://arxiv.org/abs/2211.15841). Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia. _Proceedings of the Sixth Machine Learning and Systems_, May 2023.
- ["ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs"](https://arxiv.org/abs/2210.03052). Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu. _Proceedings of the 37th IEEE International Parallel & Distributed Processing Symposium (Best Paper)_, May 2023.
- ["A Framework for Fine-Grained Synchronization of Dependent GPU Kernels"](https://arxiv.org/abs/2305.13450). Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, Olli Saarikivi. _Computing Research Repository_, May 2023.
- ["Graphene: An IR for Optimized Tensor Computations on GPUs"](https://dl.acm.org/doi/pdf/10.1145/3582016.3582018). Hagedorn, Bastian, Bin Fan, Hanfeng Chen, Cris Cecka, Michael Garland, Vinod Grover. _Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, March 2023.
- ["Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search"](https://arxiv.org/abs/2302.01382). Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, Yu Emma Wang. _arXiv_, February 2023.
- ["Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism"](https://dl.acm.org/doi/abs/10.1145/3572848.3577500). Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. _Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming_, February 2023.
- ["Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"](https://arxiv.org/abs/2301.03598). Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. _arXiv_, January 2023.
## 2022
- ["GPU Load Balancing"](https://arxiv.org/abs/2212.08964). Muhammad Osama. _Doctoral dissertation, University of California, Davis_, December 2022.
- ["Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production"](https://arxiv.org/abs/2211.10017). Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla. _Proceedings of the Third Workshop on Simple and Efficient Natural Language Processing_, December 2022.
- ["Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance"](https://arxiv.org/abs/2110.15238). Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu. _Proceedings of the 5th MLSys Conference_, August 2022.
- ["Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance"](https://arxiv.org/abs/2203.03341). Hiroyuki Ootomo, Rio Yokota. _International Journal of High Performance Computing_, March 2022.
- ["Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads"](https://arxiv.org/abs/2105.05720). Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, Olli Sarikivi. _Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, February 2022.
## 2021
- ["Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs"](https://dl.acm.org/doi/abs/10.1145/3458817.3476184). Jack Kosaian, K. V. Rashmi. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2021.
- ["Real-time Neural Radiance Caching for Path Tracing"](https://dl.acm.org/doi/abs/10.1145/3450626.3459812). Thomas Muller, Fabrice Rousselle, Jan Novak, Alex Keller. _ACM Trans. Graph._, August 2021.
## 2020
- ["Scalable Knowledge Graph Analytics at 136 Petaflop/s"](https://www.computer.org/csdl/proceedings-article/sc/2020/999800a061/1oeORDgCM0g). Ramakrishnan Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, Robert Patton, Richard Vuduc, Thomas Potok. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2020.
- ["Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity
"](https://arxiv.org/abs/2008.13006). Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, Yuhao Zhu. _Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis_, November 2020.
- ["Strassen's Algorithm Reloaded on GPUs"](https://dl.acm.org/doi/10.1145/3372419). Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn. _ACM Transactions on Mathematical Software_, March 2020.
## Copyright
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```

726
README.md
View File

@ -1,219 +1,663 @@
![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
# Overview
# CUTLASS 1.0
# CUTLASS 4.3.0
CUTLASS 1.0 is a collection of CUDA C++ template abstractions for implementing
high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA.
It incorporates strategies for hierarchical decomposition and data movement similar
to those used to implement cuBLAS. CUTLASS decomposes these "moving parts" into
reusable, modular software components abstracted by C++ template classes. These
thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
and tuned via custom tiling sizes, data types, and other algorithmic policy. The
resulting flexibility simplifies their use as building blocks within custom kernels
and applications.
_CUTLASS 4.3.0 - Oct 2025_
To support a wide variety of applications, CUTLASS provides extensive support for
mixed-precision computations, providing specialized data-movement and
multiply-accumulate abstractions for 8-bit integer, half-precision floating
point (FP16), single-precision floating point (FP32), and double-precision floating
point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targeting
the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture
and beyond.
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
and related computations at all levels and scales within CUDA. It incorporates strategies for
hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular
software components and abstractions.
CUTLASS 1.0 has changed substantially from our preview release described in
the [CUTLASS Parallel For All](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda)
post. We have decomposed the structure of the GEMM computation into deeper, structured
primitives for loading data, computing predicate masks, streaming data at each level of
the GEMM hierarchy, and updating the output matrix.
Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned
via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies
their use as building blocks within custom kernels and applications.
CUTLASS 1.0 is described in the [Doxygen documentation](https://github.com/NVIDIA/cutlass/docs)
and our talk at the [GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and
these abstractions provide extensive support for a wide range of computations including
mixed-precision computations, specialized data-movement (async copy) and
multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16,
[FP32 emulation via tensor core instruction](https://github.com/NVIDIA/cutlass/tree/main/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm),
8b floating point types (e5m2 and e4m3),
block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8),
narrow integer types (4 and 8b signed and unsigned integers),
and binary 1b data types (where architectures allow for the
native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations
targeting the programmable, high-throughput _Tensor Cores_ implemented by
NVIDIA's Ampere, Hopper, and Blackwell architectures.
We believe it will become an indispensable tool for students, researchers, and performance
engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel
designs, and bringing optimized solutions into production.
CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.
To get started quickly - please refer :
- [CUTLASS C++ Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html).
- [CuTe DSL Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html).
# What's New in CUTLASS 4.3
## CuTe DSL
* Debuggability improvements:
- Supported source location tracking for DSL APIs
- Supported dumping PTX and CUBIN code
* More examples and notebooks to get started with CuTe DSL:
- [Kernel launch with Programmatic Dependent Launch](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/programmatic_dependent_launch.py)
- Improved performance of elementwise kernel (https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/elementwise_apply.py):
+ Generalize code to handle list of input tensors
+ Generalize TV layout computation to handle different data types
- Demonstrate the new Pipeline APIs in [Blackwell SM100 persistent dense GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py):
+ New Pipeline API `PipelineProducer` and `PipelineConsumer` to simplify code (no more explicit pipeline state management)
- Separate epilogue code for non-TMA and TMA implementation
+ Note that the updates simplifies the codes but existing APIs still work and are supported
- [Basic Blackwell SM100 GEMM with decent performance](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/tutorial_gemm/fp16_gemm_0.py)
+ Simple tutorial achieves 84% SOL performance with MNK 8K
- Reworked [elementwise add notebook](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/elementwise_add.ipynb) with more details and detailed explanation about TV layout
+ Updated implementation to handle general data type and multiple inputs
+ Updated explanation for TV layout in simpler language
+ Added visualization of TV Layout with 3rd party utils
- [Benchmark and autotune demonstration](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/notebooks/benchmark_autotune.ipynb)
* More examples of authorizing peak-performance kernels:
- [Blackwell SM100 mixed-input GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/mixed_input_gemm.py)
- [Blackwell SM100 persistent blockwise dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/blockwise_gemm.py)
- [Blackwell SM100 persistent blockwise contiguous grouped dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/contiguous_grouped_gemm.py)
- [Blackwell SM100 persistent blockwise masked grouped dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/blockwise_gemm/masked_grouped_gemm.py)
- [Blackwell SM100 fmha bwd](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/fmha_bwd.py)
- [Blackwell SM100 mla](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/mla.py)
- [Hopper SM90 persistent dense GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/hopper/dense_gemm_persistent.py)
- [Blackwell GeForce batched dense GEMM](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell_geforce/dense_gemm.py)
- [Ampere HSTU Attention](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/ampere/hstu_attention.py)
* API updates:
- Please refer to [DSL API changelog](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_api/changelog.html) for details
* Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in [Blackwell SM100 persistent dense blockscaled GEMM with static scheduling](https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/blackwell/dense_blockscaled_gemm_persistent.py).
- Fixed ``TensorSSA.reduce`` to support static value as initial value
- Updated docstring for following APIs to be more concise and easier to understand:
- ``make_layout_tv``
- ``is_static``
- ``PipelineAsync``
- ``SmemAllocator``
- Fixed documentation for ``pipeline``, ``utils`` and ``cute.math``
## CUTLASS C++
* Further enhance Blackwell SM100 Attention kernels in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
* Add Ragged Contiguous Grouped gemm kernel in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm/).
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
* Optimize group gemm kernels by enabling async TMA desc update.
* Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: [fprop_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/fprop/sm100_conv3d_fprop_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu), [dgrad_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/dgrad/sm100_conv3d_dgrad_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu), [wgrad_streamK](https://github.com/NVIDIA/cutlass/tree/main/test/unit/conv/device_3x/wgrad/sm100_conv2d_wgrad_implicit_gemm_f16_f16_f16_tensorop_f16_streamk.cu).
* Add profiler support for Blackwell SM100 and SM120 blockscaled sparse kernels.
* Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic `cutlass::int8_t` and replace it with `int8_t`.
* Fix some profiler issues:
- Add some missing reference kernels.
- Add calculation of scale factor A and B in function `bytes_with_problem_shape` of block scaled profiler.
Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
**See the [CHANGELOG](https://docs.nvidia.com/cutlass/latest/CHANGELOG.html) for details of all past releases and updates.**
# Performance
<p align="center"><img src=/media/images/cutlass-performance-plot.png></p>
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels,
they exhibit performance comparable to cuBLAS for scalar GEMM
computations. The above figure shows CUTLASS performance relative to cuBLAS
for large matrix dimensions (M=10240, N=K=4096) running on an NVIDIA Titan V GPU
when compiled with CUDA 9.2.
they exhibit nearly optimal utilization of peak theoretical throughput. The figure below
shows CUTLASS 3.8's performance as a % of theoretical peak utilization
on various input and output data types when run on NVIDIA Blackwell SM100 architecture GPU.
![ALT](media/images/cutlass-3.8-blackwell-gemm-peak-performance.svg "")
The two figures below show the continual CUTLASS performance improvements
on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
CUTLASS 3.1.
CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads).
Tensor Core operations are implemented using CUDA's
[mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
[wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
![ALT](media/images/cutlass-3.5.1-gemm-peak-performance.png "")
![ALT](media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png "")
# CuTe
CUTLASS 3.0 introduced a new core library, CuTe, to describe and manipulate tensors of threads and data.
CuTe is a collection of C++ CUDA template abstractions for
defining and operating on hierarchically multidimensional layouts of threads and data.
CuTe provides `Layout` and `Tensor` objects that compactly package the type,
shape, memory space, and layout of data, while performing the complicated indexing for the user.
This lets programmers focus on the logical descriptions of their algorithms while
CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design,
implement, and modify all dense linear algebra operations.
The core abstractions of CuTe are hierarchically multidimensional layouts
which can be composed with data arrays to represent tensors.
The representation of layouts is powerful enough to represent nearly
everything we need to implement efficient dense linear algebra.
Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
CUTLASS 3.0 and beyond adopts CuTe throughout the GEMM hierarchy in its templates.
This greatly simplifies the design and improves code composability and readability.
More documentation specific to CuTe can be found in its
[dedicated documentation directory](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/cute/00_quickstart.html).
# Compatibility
CUTLASS requires CUDA 9 and performs best with [CUDA 9.2 Toolkit](ttps://developer.nvidia.com/cuda-toolkit) or later.
Minimum requirements:
- Architecture: Volta (compute capability 7.0)
- Compiler: Must support at least C++17
- CUDA Toolkit version: 11.4
CUTLASS requires a C++17 host compiler and
performs best when built with the [**CUDA 12.8 Toolkit**](https://developer.nvidia.com/cuda-downloads).
It is also compatible with CUDA 11.4, CUDA 11.5, CUDA 11.6, CUDA 11.7, CUDA 11.8, and all other CUDA 12.x versions.
## Operating Systems
We have tested the following environments.
|**Operating System** | **Compiler** |
|-----------------|----------|
| Windows 10 | Microsoft Visual Studio 2015|
| | Microsoft Visual Studio 2017|
| Ubuntu 14.04 | GCC 4.8.2 |
| Ubuntu 16.04 | GCC 5.4.0 |
| Ubuntu 18.04 | GCC 7.5.0 |
| Ubuntu 20.04 | GCC 10.3.0 |
| Ubuntu 22.04 | GCC 11.2.0 |
Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on
any Maxwell-, Pascal-, or Volta-architecture NVIDIA GPU.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
|**GPU**|
|---|
|NVIDIA GeForce 1080|
|NVIDIA TitanXP|
|NVIDIA Tesla P100|
|NVIDIA Tesla V100|
|NVIDIA TitanV|
## Hardware
CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on Volta, Turing, Ampere, Ada, and Hopper architecture based NVIDIA GPUs.
|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit Required by CUTLASS-3**|
|---|---|---|
|NVIDIA V100 Tensor Core GPU |7.0|11.4|
|NVIDIA TitanV |7.0|11.4|
|NVIDIA GeForce RTX 20x0 series |7.5|11.4|
|NVIDIA T4 |7.5|11.4|
|NVIDIA A100 Tensor Core GPU |8.0|11.4|
|NVIDIA A10 |8.6|11.4|
|NVIDIA GeForce RTX 30x0 series |8.6|11.4|
|NVIDIA GeForce RTX 40x0 series |8.9|11.8|
|NVIDIA L40 |8.9|11.8|
|NVIDIA H100 Tensor Core GPU |9.0|11.8|
|NVIDIA H200 Tensor Core GPU |9.0|11.8|
|NVIDIA B200 Tensor Core GPU |10.0|12.8|
|NVIDIA B300 Tensor Core GPU |10.3|13.0|
|NVIDIA DRIVE Thor |11.0|13.0|
|NVIDIA GeForce RTX 50x0 series |12.0|12.8|
|NVIDIA DGX Spark |12.1|13.0|
## Target Architecture
In general, PTX code generated for one target architecture can be run on future architectures
(i.e., it is forward compatible).
However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose
PTX does not have forward compatibility guarantees.
Several Hopper and Blackwell PTX instructions fall under this category of
architecture-accelerated features, and thus require a `sm_90a` or `sm100a` target architecture
(note the "a" appended). For more details on this and other architecture-accelerated instructions,
please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
The target architecture information is passed on to CUTLASS via the cmake flag
`CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100,
users are required to build CUTLASS with `90a` as the target architecture.
If a user accidentally builds a kernel which uses SM90a features
(e.g. Hopper Tensor Core Instructions), using the SM90 target
(note the lack of "a"), with either CUDA Toolkit 12 or 11.8,
the kernel is expected to fail with a runtime error.
```
cmake .. -DCUTLASS_NVCC_ARCHS="90a"
```
Or
```
cmake .. -DCUTLASS_NVCC_ARCHS="100a"
```
Note: The NVIDIA Blackwell SM100 architecture used in the datacenter
products has a different compute capability than the one underpinning
NVIDIA Blackwell GeForce RTX 50 series GPUs (SM120). As a result, kernels
compiled for Blackwell SM100 architecture with arch conditional features
(using `sm100a`) are not compatible with RTX 50 series GPUs.
Please refer to the [functionality documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/functionality.html)
for details on which kernels require which target architectures.
# Documentation
CUTLASS is described in the following documents and the accompanying
[Doxygen documentation](https://nvidia.github.io/cutlass).
- [Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html) - basics of building and running CUTLASS
- [Functionality](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/functionality.html) - summarizes functionality available in CUTLASS
- [Efficient GEMM in CUDA](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/efficient_gemm.html) - describes how GEMM kernels may be implemented efficiently in CUDA
- [CUTLASS 3.x Design](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/cutlass_3x_design.html) - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
- [GEMM API 3.x](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/gemm_api_3x.html) - describes the CUTLASS 3.x GEMM model and C++ template concepts
- [GEMM API 2.x](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/gemm_api.html) - describes the CUTLASS 2.x GEMM model and C++ template concepts
- [Implicit GEMM Convolution](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/implicit_gemm_convolution.html) - describes 2-D and 3-D convolution in CUTLASS
- [Code Organization](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/code_organization.html) - describes the organization and contents of the CUTLASS project
- [Terminology](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/terminology.html) - describes terms used in the code
- [Programming Guidelines](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/programming_guidelines.html) - guidelines for writing efficient modern CUDA C++
- [Fundamental types](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/fundamental_types.html) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
- [Layouts](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/layout.html) - describes layouts of matrices and tensors in memory
- [Tile Iterators](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/tile_iterator_concept.html) - describes C++ concepts for iterating over tiles of matrices in memory
- [CUTLASS Profiler](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html) - command-line driven profiling application
- [CUTLASS Utilities](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/utilities.html) - additional templates used to facilitate rapid development
- [Dependent kernel launch](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/dependent_kernel_launch.html) - describes a new feature in Hopper which allows overlapping dependent
kernels in the same stream, and how it is used in CUTLASS.
# Resources
We have also described the structure of an efficient GEMM in our talk at the
[GPU Technology Conference 2018](http://on-demand.gputechconf.com/gtc/2018/presentation/s8854-cutlass-software-primitives-for-dense-linear-algebra-at-all-levels-and-scales-within-cuda.pdf).
- [CUTLASS: Software Primitives for Dense Linear Algebra at All Levels and Scales within CUDA](https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2018-s8854/)
- [Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100](https://www.nvidia.com/en-us/on-demand/session/gtcsj20-s21745/)
- [Accelerating Convolution with Tensor Cores in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31883/)
- [Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41996/)
- [CUTLASS: Python API, Enhancements, and NVIDIA Hopper](https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41131/)
# Building CUTLASS
CUTLASS is a header-only template library and does not need to be built to be used by other
projects. However, we distribute extensive unit tests and utility programs to demonstrate
CUTLASS. These instructions are for building those test programs.
projects. Client applications should target CUTLASS's `include/` directory in their include
paths.
CUTLASS's unit tests depend on Google Test which exists as a git submodule. You can fetch
submodules as follows.
CUTLASS unit tests, examples, and utilities can be build with CMake.
The minimum version of CMake is given in the [Quickstart guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html).
Make sure the `CUDACXX` environment variable points to NVCC in the CUDA Toolkit installed
on your system.
```
$ git submodule update --init --recursive
```bash
$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
```
CUTLASS can be build with CMake starting version 3.10. By default CUTLASS will build kernels
for CUDA architecture versions 5.0, 6.0, 6.1 and 7.0. To reduce compile time you can specify
Create a build directory within the CUTLASS project, then run CMake. By default CUTLASS will build kernels
for CUDA architecture versions 5.0, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6, 8.9, and 9.0.
To reduce compile time you can specify
the architectures to build CUTLASS for by changing the CMake configuration setting
`CUTLASS_NVCC_ARCHS`.
Create a build directory within the CUTLASS project, then run CMake once.
```
```bash
$ mkdir build && cd build
$ cmake ..
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 # compiles for NVIDIA's Ampere Architecture
```
Compile the CUTLASS project by running Make. Include the -j argument to compile sources in
parallel and speed up the build process.
From the `build/` directory, compile and run the CUTLASS unit tests by building the target `test_unit` with make.
```
$ make -j12
...
$
```
The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS,
and they may be executed in parallel via make's `-j` command line argument.
Verify CUTLASS has been built correctly by running the unit tests from the build/ directory.
```
$ ./tools/test/unit/cutlass_unit_test
```bash
$ make test_unit -j
...
...
...
[----------] Global test environment tear-down
[==========] 481 tests from 24 test cases ran. (5954 ms total)
[ PASSED ] 481 tests.
[==========] 946 tests from 57 test cases ran. (10812 ms total)
[ PASSED ] 946 tests.
```
All tests should pass, though the exact number of tests may vary over time.
All tests should pass on supported platforms, though the exact number of tests may vary over time.
# Project Structure
CUTLASS is arranged as a header-only library with several example test programs
that demonstrate instantiating a GEMM task within a CUDA kernel. The Doxygen documentation
provides a complete list of files, classes, and template concepts defined in the CUTLASS
project. A brief summary is described below.
CUTLASS is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests.
[Doxygen documentation](https://nvidia.github.io/cutlass) provides a complete list of files, classes,
and template concepts defined in the CUTLASS project.
The CUTLASS library is defined in the cutlass/ directory and consists of CUDA C++ template
classes and other definitions for implementing efficient GPU GEMM kernels. A set of core
classes and templates define basic primitives that are then applied to compute GEMM via
templates in the cutlass/gemm directory.
A detailed explanation of the source code organization may be found in the
[CUTLASS documentation](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/code_organization.html), but several main components are summarized below.
## CUTLASS Template Library
```
cutlass/
gemm/
util/
<core API components>
include/ # client applications should target this directory in their build's include paths
cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
arch/ # direct exposure of architecture features (including instruction-level GEMMs)
conv/ # code specialized for convolution
epilogue/ # code specialized for the epilogue of gemm/convolution
gemm/ # code specialized for general matrix product computations
layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory
platform/ # CUDA-capable Standard Library components
reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" model
thread/ # simt code that can be performed within a CUDA thread
transform/ # code specialized for layout, type, and domain transformations
* # core vocabulary types, containers, and basic numeric operations
cute/ # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy
algorithm/ # Definitions of core operations such as copy, gemm, and operations on cute::tuples
arch/ # Bare bones PTX wrapper structs for copy and math instructions
atom/ # Meta-information either link to or built from arch/ operators
mma_atom.hpp # cute::Mma_Atom and cute::TiledMma
copy_atom.hpp # cute::Copy_Atom and cute::TiledCopy
*sm*.hpp # Arch specific meta-information for copy and math operations
* # Core library types such as Shape, Stride, Layout, Tensor, and associated operations
```
Several tools and test programs are also distributed with the CUTLASS library. They are
contained in the following directories.
### CUTLASS SDK Examples
[CUTLASS SDK examples](https://github.com/NVIDIA/cutlass/tree/main/examples) apply CUTLASS templates to implement basic computations.
### Tools
```
tools/
test/
unit/
core/
gemm/
perf/
util/
<utilities>
library/ # CUTLASS Instance Library - contains instantiations of all supported CUTLASS templates
include/
cutlass/
library/
profiler/ # CUTLASS Profiler - command-line utility for executing operations in the
# CUTLASS Library
util/ # CUTLASS Utilities - contains numerous helper classes for
include/ # managing tensors in device memory, reference
cutlass/ # implementations for GEMM, random initialization
util/ # of tensors, and I/O.
```
### Test
The `test/unit/` directory consist of unit tests implemented with Google Test that demonstrate
basic usage of Core API components and complete tests of the CUTLASS GEMM computations.
Instructions for building and running the Unit tests are described in the [Quickstart guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html).
# Performance Profiling
The `test/perf/` directory contains a command-line utility for launching each of the GEMM kernels.
Its usage is shown below.
The `tools/profiler/` directory contains a command-line utility for launching each of the GEMM kernels.
It can be built as follows:
Program usage:
```bash
$ make cutlass_profiler -j16
```
## Building all GEMM and Convolution kernels (_long_ build times)
By default, only one tile size is instantiated for each data type, math instruction, and layout.
To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
Beware, this results in *tens of thousands* of kernels and long build times.
This would also result in a large binary size and on some platforms linker to fail on building the library.
Therefore, it's highly recommended to generate only a subset of kernels as demonstrated in the sub-section below.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=all
...
$ make cutlass_profiler -j16
```
## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
or a subset of kernels for NVIDIA Ampere and Turing architecture:
### Building a subset Tensor Core GEMM kernels
To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture,
use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
...
$ make cutlass_profiler -j16
```
Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
```bash
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Passed
Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 \
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
--max_cc=1024
Bytes: 118489088 bytes
FLOPs: 115992428544 flops
Runtime: 1.55948 ms
Memory: 70.7616 GiB/s
Math: 74378.8 GFLOP/s
=============================
...
```
### Building one CUDA Core GEMM kernel
To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
$ make cutlass_profiler -j16
```
Example command line for profiling single SGEMM CUDA kernel is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
Status: Success
Verification: ON
Disposition: Passed
cuBLAS: Passed
Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
--batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
Bytes: 180355072 bytes
FLOPs: 115992428544 flops
Runtime: 6.73655 ms
Memory: 24.934 GiB/s
Math: 17218.4 GFLOP/s
=============================
```
### Building a subset of Tensor Core Convolution kernels
To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation
and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
$ make cutlass_profiler -j16
```
Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: conv2d
Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5 \
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
Bytes: 1130659840 bytes
FLOPs: 118482796544 flops
Runtime: 0.711496 ms
Memory: 1479.99 GiB/s
Math: 166526 GFLOP/s
=============================
...
```
### Building one Convolution CUDA kernel
To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation
and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
$ make cutlass_profiler -j16
```
Example command line for profiling one CUDA Core convolution kernel:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: conv2d
Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
--eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
Bytes: 2055798784 bytes
FLOPs: 118482796544 flops
Runtime: 7.34266 ms
Memory: 260.752 GiB/s
Math: 16136.2 GFLOP/s
=============================
```
cutlass_perf_test [options]
--help
--append=<true|false*> If true, appends output to existing CSV file. If false, overwrites.
--alpha=<alpha> Value for alpha to be used in GEMM experiments
--beta=<beta> Value for beta to be used in GEMM experiments
--dist=<distribution> Describes the random distribution of each of the input matrix operands.
--execution_mode=<mode> Specifies execution mode: profile, verify, single
--output=<filename.csv> Writes summary of profiling to specified .csv file
--iterations=<timing iterations> maximum number of iterations to execute when profiling
--m=<height>[:max height[:step]] Height of GEMM problem (number of rows of C). May specify a range with optional step size.
--n=<width>[:max width[:step]] Width of GEMM problem (number of columns of C). May specify a range with optional step size.
--k=<depth>[:max depth[:step]] Size of inner dimension of A and B. May specify a range with optional step size.
--kernels=<{s|d|h|i|wmma}gemm_{nn,nt,tn,tt}> Select GEMM datatype and layout to use for tests
--peak=<bool> If true, only reports peak performance per kernel after profiling specified problem space.
--save_workspace={*never,incorrect,always} Specifies when to save the GEMM inputs and results to the filesystem.
--seed=<seed> Random seed used by the random number generator in initializing input matrices.
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each column.
## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
- [GEMM CMake Examples](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html#gemm-cmake-examples)
- [Implicit GEMM convolution CMake Examples](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html#convolution-cmake-examples)
- [Further details about the CUTLASS Profiler are described here.](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/profiler.html)
Example usage:
# Runs one problem size for all kernels
$ ./tools/test/perf/cutlass_perf_test --m=10240 --n=1024 --k=1024
# Varies GEMM K dimension for SGEMM and IGEMM with column-major multiplicands
$ ./tools/test/perf/cutlass_perf_test --m=10240 --n=4096 --k=1024:8192:128 --kernels=sgemm_nn,igemm_nn
```
# About
CUTLASS is released by NVIDIA Corporation as Open Source software under the
3-clause "New" BSD license.
[3-clause "New" BSD license](LICENSE.txt).
# Contributors
The official list of CUTLASS developers and contributors is available here: [CONTRIBUTORS](CONTRIBUTORS.md).
# Copyright
Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause
```
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of
conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
to endorse or promote products derived from this software without specific prior written
permission.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```

54
bin2hex.cmake Normal file
View File

@ -0,0 +1,54 @@
# Copyright (c) 2019 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# A small utility function which generates a C-header from an input file
function(FILE_TO_C_STRING FILENAME VARIABLE_NAME OUTPUT_STRING ZERO_TERMINATED)
FILE(READ "${FILENAME}" HEX_INPUT HEX)
if (${ZERO_TERMINATED})
string(APPEND HEX_INPUT "00")
endif()
string(REGEX REPLACE "(....)" "\\1\n" HEX_OUTPUT ${HEX_INPUT})
string(REGEX REPLACE "([0-9a-f][0-9a-f])" "char(0x\\1)," HEX_OUTPUT ${HEX_OUTPUT})
set(HEX_OUTPUT "static char const ${VARIABLE_NAME}[] = {\n ${HEX_OUTPUT}\n};\n")
set(${OUTPUT_STRING} "${HEX_OUTPUT}" PARENT_SCOPE)
endfunction()
# message("Create header file for ${FILE_IN}")
# message("Create header file for ${FILE_OUT}")
file_to_c_string(${FILE_IN} ${VARIABLE_NAME} OUTPUT_STRING ZERO_TERMINATED)
set(RESULT "#pragma once\n")
string(APPEND RESULT "namespace cutlass {\n")
string(APPEND RESULT "namespace nvrtc {\n")
string(APPEND RESULT "${OUTPUT_STRING}")
string(APPEND RESULT "} // namespace nvrtc\n")
string(APPEND RESULT "} // namespace cutlass\n")
file(WRITE "${FILE_OUT}" "${RESULT}")

View File

@ -1,17 +0,0 @@
#!/bin/bash
set -e
function formatFiles {
for f in `find "$1" -type f -name "*.$2"` ; do
COMMAND="clang-format -i $f"
echo $COMMAND
$COMMAND
done
}
formatFiles "cutlass" "h"
formatFiles "tools/test" "h"
formatFiles "tools/test" "cpp"
formatFiles "tools/util" "h"

View File

@ -0,0 +1,52 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Generated file
set(TEST_SETS_SUPPORTED @TEST_SETS_SUPPORTED@)
if (NOT DEFINED ENV{CUTLASS_TEST_SETS})
set(ENV{CUTLASS_TEST_SETS} @CUTLASS_DEFAULT_ACTIVE_TEST_SETS@)
endif()
foreach(TEST_SET_REQUESTED IN ITEMS $ENV{CUTLASS_TEST_SETS})
if (NOT TEST_SET_REQUESTED IN_LIST TEST_SETS_SUPPORTED)
message(STATUS "Skipping tests for @TEST_EXE_PATH@ as ${TEST_SET_REQUESTED} is not in the set of [${TEST_SETS_SUPPORTED}].")
return()
endif()
endforeach()
set(TEST_EXE_PATH @TEST_EXE_PATH@)
set(TEST_EXE_WORKING_DIRECTORY @TEST_EXE_WORKING_DIRECTORY@)
set(CUTLASS_USE_EXTENDED_ADD_TEST_FORMAT @TEST_USE_EXTENDED_FORMAT@)
if (DEFINED ENV{CUTLASS_TEST_EXECUTION_ENVIRONMENT})
set(_CUTLASS_TEST_EXECUTION_ENVIRONMENT $ENV{CUTLASS_TEST_EXECUTION_ENVIRONMENT})
else()
set(_CUTLASS_TEST_EXECUTION_ENVIRONMENT @CUTLASS_TEST_EXECUTION_ENVIRONMENT@)
endif()

View File

@ -0,0 +1,43 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
if (CUTLASS_USE_EXTENDED_ADD_TEST_FORMAT)
# The longform/extended format allows generator expressions to be
# expanded property and is useful in contexts where the files need
# to be immediately included into being-processed cmake code.
add_test(NAME @TESTCASE_NAME@ COMMAND ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
else()
add_test(@TESTCASE_NAME@ ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
endif()
if (TEST_EXE_WORKING_DIRECTORY)
set_tests_properties(@TESTCASE_NAME@ PROPERTIES WORKING_DIRECTORY "${TEST_EXE_WORKING_DIRECTORY}")
endif()
set_tests_properties(@TESTCASE_NAME@ PROPERTIES DISABLED @__DISABLE_TESTS@)

View File

@ -0,0 +1,9 @@
get_filename_component(NvidiaCutlass_CMAKE_DIR "${CMAKE_CURRENT_LIST_FILE}" PATH)
include(CMakeFindDependencyMacro)
if(TARGET nvidia::cutlass::CUTLASS)
return()
endif()
include("${NvidiaCutlass_CMAKE_DIR}/NvidiaCutlassTargets.cmake")

View File

@ -0,0 +1,42 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
set(CPACK_PACKAGE_NAME NvidiaCutlass)
set(CPACK_PACKAGE_VENDOR NVIDIA)
set(CPACK_PACKAGE_CONTACT info@nvidia.com)
set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "CUTLASS CUDA C++ Template Linear Algebra Library")
set(CPACK_PACKAGE_INSTALL_DIRECTORY ${CPACK_PACKAGE_NAME})
set(CPACK_PACKAGE_VERSION_MAJOR ${PROJECT_VERSION_MAJOR})
set(CPACK_PACKAGE_VERSION_MINOR ${PROJECT_VERSION_MINOR})
set(CPACK_PACKAGE_VERSION_PATCH ${PROJECT_VERSION_PATCH})
set(CPACK_VERBATIM_VARIABLES YES)
# set(CPACK_PACKAGE_DESCRIPTION_FILE ${CMAKE_CURRENT_LIST_DIR}/Description.txt)
# set(CPACK_RESOURCE_FILE_WELCOME ${CMAKE_CURRENT_LIST_DIR}/Welcome.txt)
# set(CPACK_RESOURCE_FILE_LICENSE ${CMAKE_CURRENT_LIST_DIR}/License.txt)
# set(CPACK_RESOURCE_FILE_README ${CMAKE_CURRENT_LIST_DIR}/Readme.txt)
include(CPack)

52
cmake/googletest.cmake Normal file
View File

@ -0,0 +1,52 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
include(FetchContent)
set(GOOGLETEST_DIR "" CACHE STRING "Location of local GoogleTest repo to build against")
if(GOOGLETEST_DIR)
set(FETCHCONTENT_SOURCE_DIR_GOOGLETEST ${GOOGLETEST_DIR} CACHE STRING "GoogleTest source directory override")
endif()
set(GTEST_REPOSITORY "https://github.com/google/googletest.git" CACHE STRING "GoogleTest repo to fetch")
FetchContent_Declare(
googletest
GIT_REPOSITORY ${GTEST_REPOSITORY}
GIT_TAG v1.14.0
)
FetchContent_GetProperties(googletest)
if(NOT googletest_POPULATED)
FetchContent_Populate(googletest)
if (MSVC)
set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
endif()
add_subdirectory(${googletest_SOURCE_DIR} ${googletest_BINARY_DIR} EXCLUDE_FROM_ALL)
endif()

49
cmake/nop.cu Normal file
View File

@ -0,0 +1,49 @@
/***************************************************************************************************
* Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Basic CUDA file for testing compiler flags.
*/
__device__ int inner()
{
return -1;
}
__global__ void test()
{
inner();
}
int main()
{
test<<<1,1>>>();
return 0;
}

View File

@ -0,0 +1,34 @@
/***************************************************************************************************
* Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* 1. Redistributions of source code must retain the above copyright notice, this
* list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* 3. Neither the name of the copyright holder nor the names of its
* contributors may be used to endorse or promote products derived from
* this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
#define CUTLASS_BUILD @CUTLASS_VERSION_BUILD@
#define CUTLASS_REVISION "@CUTLASS_REVISION@"

152
cuBLAS.cmake Normal file
View File

@ -0,0 +1,152 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
message(STATUS "Configuring cublas ...")
if((DEFINED CUTLASS_ENABLE_CUBLAS AND NOT CUTLASS_ENABLE_CUBLAS) OR
(DEFINED CUBLAS_ENABLED AND NOT CUBLAS_ENABLED))
# Don't add cuBLAS if it's defined and false, assume it's not found.
set(CUBLAS_FOUND OFF)
message(STATUS "cuBLAS Disabled.")
elseif(NOT TARGET cublas)
find_path(
_CUBLAS_INCLUDE_DIR
NAMES cublas_v2.h
HINTS
${CUBLAS_INCLUDE_PATH}
ENV CUBLAS_INCLUDE_PATH
${CUBLAS_PATH}
ENV CUBLAS_PATH
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
include
)
find_library(
_CUBLAS_LIBRARY
NAMES cublas
HINTS
${CUBLAS_LIBRARY_PATH}
ENV CUBLAS_LIBRARY_PATH
${_CUBLAS_INCLUDE_DIR}/..
${CUBLAS_PATH}
ENV CUBLAS_PATH
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
lib64
lib/x64
lib
)
if(_CUBLAS_INCLUDE_DIR AND _CUBLAS_LIBRARY)
message(STATUS "cuBLAS: ${_CUBLAS_LIBRARY}")
message(STATUS "cuBLAS: ${_CUBLAS_INCLUDE_DIR}")
set(CUBLAS_FOUND ON CACHE INTERNAL "cublas Library Found")
set(CUBLAS_LIBRARY ${_CUBLAS_LIBRARY})
set(CUBLAS_INCLUDE_DIR ${_CUBLAS_INCLUDE_DIR})
else()
message(STATUS "cublas not found.")
set(CUBLAS_FOUND OFF CACHE INTERNAL "cublas Library Found")
endif()
endif()
set(CUTLASS_ENABLE_CUBLAS ${CUBLAS_FOUND} CACHE BOOL "Enable CUTLASS to build with cuBLAS library.")
if(CUTLASS_ENABLE_CUBLAS AND NOT CUBLAS_FOUND)
message(FATAL_ERROR "CUTLASS_ENABLE_CUBLAS enabled but cuBLAS library could not be found.")
endif()
if(CUTLASS_ENABLE_CUBLAS AND NOT TARGET cublas)
if(WIN32)
add_library(cublas STATIC IMPORTED GLOBAL)
else()
add_library(cublas SHARED IMPORTED GLOBAL)
endif()
add_library(nvidia::cublas ALIAS cublas)
set_property(
TARGET cublas
PROPERTY IMPORTED_LOCATION
${CUBLAS_LIBRARY})
target_include_directories(
cublas
INTERFACE
$<INSTALL_INTERFACE:include>
$<BUILD_INTERFACE:${CUBLAS_INCLUDE_DIR}>)
find_library(
_CUBLASLT_LIBRARY
NAMES cublasLt
HINTS
${CUBLAS_LIBRARY_PATH}
ENV CUBLAS_LIBRARY_PATH
${_CUBLAS_INCLUDE_DIR}/..
${CUBLAS_PATH}
ENV CUBLAS_PATH
${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES
lib64
lib/x64
lib
)
if(_CUBLASLT_LIBRARY AND NOT TARGET cublasLt)
if(WIN32)
add_library(cublasLt STATIC IMPORTED GLOBAL)
else()
add_library(cublasLt SHARED IMPORTED GLOBAL)
endif()
set_property(
TARGET cublasLt
PROPERTY IMPORTED_LOCATION
${_CUBLASLT_LIBRARY})
add_library(nvidia::cublasLt ALIAS cublasLt)
target_link_libraries(cublas INTERFACE cublasLt)
endif()
endif()
message(STATUS "Configuring cuBLAS ... done.")

112
cuDNN.cmake Normal file
View File

@ -0,0 +1,112 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
if(DEFINED CUDNN_ENABLED)
set(CUTLASS_ENABLE_CUDNN ${CUDNN_ENABLED} CACHE BOOL "Enable CUTLASS to build with cuDNN library.")
endif()
if(DEFINED CUTLASS_ENABLE_CUDNN AND NOT CUTLASS_ENABLE_CUDNN)
return()
endif()
message(STATUS "Configuring cuDNN ...")
find_path(
_CUDNN_INCLUDE_DIR cudnn.h
PATHS
${CUDA_TOOLKIT_ROOT_DIR}/include
$ENV{CUDNN_PATH}/include
$ENV{CUDA_PATH}/include
${CUDNN_PATH}/include
/usr/include)
find_library(
_CUDNN_LIBRARY cudnn
HINTS
${CUDA_TOOLKIT_ROOT_DIR}/lib64
${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
${CUDA_TOOLKIT_ROOT_DIR}/lib
$ENV{CUDNN_PATH}/lib64
$ENV{CUDNN_PATH}/lib/x64
$ENV{CUDNN_PATH}/lib
$ENV{CUDA_PATH}/lib64
$ENV{CUDA_PATH}/lib/x64
$ENV{CUDA_PATH}/lib
${CUDNN_PATH}/lib64
${CUDNN_PATH}/lib/x64
${CUDNN_PATH}/lib
/usr/lib/x86_64-linux-gnu
/usr/lib)
if(_CUDNN_INCLUDE_DIR AND _CUDNN_LIBRARY)
message(STATUS "cuDNN: ${_CUDNN_LIBRARY}")
message(STATUS "cuDNN: ${_CUDNN_INCLUDE_DIR}")
set(CUDNN_FOUND ON CACHE INTERNAL "cuDNN Library Found")
else()
message(STATUS "cuDNN not found.")
set(CUDNN_FOUND OFF CACHE INTERNAL "cuDNN Library Found")
endif()
set(CUTLASS_ENABLE_CUDNN ${CUDNN_FOUND} CACHE BOOL "Enable CUTLASS to build with cuDNN library.")
if (CUTLASS_ENABLE_CUDNN AND NOT TARGET cudnn)
set(CUDNN_INCLUDE_DIR ${_CUDNN_INCLUDE_DIR})
set(CUDNN_LIBRARY ${_CUDNN_LIBRARY})
if(WIN32)
add_library(cudnn STATIC IMPORTED GLOBAL)
else()
add_library(cudnn SHARED IMPORTED GLOBAL)
endif()
add_library(nvidia::cudnn ALIAS cudnn)
set_property(
TARGET cudnn
PROPERTY IMPORTED_LOCATION
${CUDNN_LIBRARY})
target_include_directories(
cudnn
INTERFACE
$<INSTALL_INTERFACE:include>
$<BUILD_INTERFACE:${CUDNN_INCLUDE_DIR}>)
endif()
if(CUTLASS_ENABLE_CUDNN AND NOT CUDNN_FOUND)
message(FATAL_ERROR "CUTLASS_ENABLE_CUDNN enabled but cuDNN library could not be found.")
endif()
message(STATUS "Configuring cuDNN ... done.")

98
customConfigs.cmake Normal file
View File

@ -0,0 +1,98 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Profiler based functional testing
set(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS OFF CACHE BOOL "Utilize profiler-based functional regressions")
set(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL ${CUTLASS_TEST_LEVEL} CACHE STRING "Profiler functional regression test level")
find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)
function(cutlass_generate_kernel_filter_and_testlist_files)
set(options)
set(oneValueArgs TEST_SET_NAME)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
execute_process(
COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CUTLASS_LIBRARY_PACKAGE_DIR}
${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/python/cutlass_library/generator.py
--generator-target=${__TEST_SET_NAME}
--cuda-version=${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}
--architectures=${CUTLASS_NVCC_ARCHS}
--kernels=\*
--disable-cutlass-package-imports
WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
RESULT_VARIABLE cutlass_FILTER_GENERATION_RESULT
OUTPUT_VARIABLE cutlass_FILTER_GENERATION_OUTPUT
OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
ERROR_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
)
if(NOT cutlass_FILTER_GENERATION_RESULT EQUAL 0)
message(FATAL_ERROR "Error generating kernel filters and testlist files. See ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log")
endif()
endfunction()
if(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS)
set(PROFILER_ARCH_LIST 100a 100f 103a 120a 120f 121a)
if (CUDA_VERSION VERSION_LESS 13.0)
list(APPEND PROFILER_ARCH_LIST 101a 101f)
else()
list(APPEND PROFILER_ARCH_LIST 110a 110f)
endif()
foreach(ARCH IN LISTS CUTLASS_NVCC_ARCHS)
if(NOT (ARCH IN_LIST PROFILER_ARCH_LIST))
message(FATAL_ERROR "Only SM${PROFILER_ARCH_LIST} compute capabilities are supported with profiler-based unit tests")
endif()
endforeach()
if(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 0)
message(STATUS "Building for L0 profiler-based functional regressions")
cutlass_generate_kernel_filter_and_testlist_files(TEST_SET_NAME kernel_testlist_l0)
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
elseif (CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 1)
message(STATUS "Building for L1 profiler-based functional regressions")
cutlass_generate_kernel_filter_and_testlist_files(TEST_SET_NAME kernel_testlist_l1)
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
endif()
endif()

View File

@ -1,102 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*!
\file
\brief Defines conversion operations among Fragments of different base type.
*/
#pragma once
#include <cutlass/fragment.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename InputFragment_, typename OutputFragment_>
struct Convert {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename InputScalar_, typename OutputScalar_, int kScalars_>
struct Convert<Fragment<InputScalar_, kScalars_>, Fragment<OutputScalar_, kScalars_> > {
/// The input fragment.
typedef Fragment<InputScalar_, kScalars_> InputFragment;
/// The output fragment.
typedef Fragment<OutputScalar_, kScalars_> OutputFragment;
/// Ctor.
CUTLASS_DEVICE Convert() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
transform(src, 0, dst);
}
/// Transform a fragment.
template <typename Fragment_>
CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
for (int i = 0; i < kScalars_; ++i) {
dst[i] = static_cast<OutputScalar_>(src[i + offset]);
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Fragment_>
struct Copy {
/// The input fragment.
typedef Fragment_ InputFragment;
/// The output fragment.
typedef Fragment_ OutputFragment;
/// Ctor.
CUTLASS_DEVICE Copy() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(Fragment_ const& src, Fragment_& dst) { transform(src, 0, dst); }
/// Transform a fragment.
template <typename InputFragment_>
CUTLASS_DEVICE void transform(InputFragment_ const& src, int offset, Fragment_& dst) {
if (sizeof(typename Fragment_::Element) == 8) {
uint64_t const* src_ptr = reinterpret_cast<uint64_t const*>(&src[offset]);
uint64_t* dst_ptr = reinterpret_cast<uint64_t*>(&dst[0]);
for (int i = 0; i < sizeof(Fragment_) / 8; ++i) {
dst_ptr[i] = src_ptr[i];
}
} else {
uint32_t const* src_ptr = reinterpret_cast<uint32_t const*>(&src[offset]);
uint32_t* dst_ptr = reinterpret_cast<uint32_t*>(&dst[0]);
for (int i = 0; i < sizeof(Fragment_) / 4; ++i) {
dst_ptr[i] = src_ptr[i];
}
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,287 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief A Coord is a coordinate of arbitrary rank into a tensor or matrix
*/
#pragma once
#include <cutlass/cutlass.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Describes identity elements
struct Identity {
/// Enumeration describing identity elements. Value assignments are significant.
/// Feel free to add or multiply by these, respectively.
enum Kind { Additive = 0, Multiplicative = 1 };
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Statically-sized array specifying Coords within a tensor
template <int N_>
struct Coord {
//
// Type and constant definitions
//
static int const N = N_;
//
// Data members
//
/// Indices
int idx[N];
//
// Methods
//
/// Default ctor initializes uniformly
CUTLASS_HOST_DEVICE
Coord(int value = 0) {
for (int i = 0; i < N; ++i) {
idx[i] = value;
}
}
/// Constructs from an array of integers
CUTLASS_HOST_DEVICE
Coord(int _idx[]) {
for (int i = 0; i < N; ++i) {
idx[i] = _idx[i];
}
}
/// Element-wise addition
CUTLASS_HOST_DEVICE
Coord operator+(Coord const& b) const {
Coord c;
for (int i = 0; i < N; ++i) {
c.idx[i] = idx[i] + b.idx[i];
}
return c;
}
/// Element-wise subtraction
CUTLASS_HOST_DEVICE
Coord operator-(Coord const& b) const {
Coord c;
for (int i = 0; i < N; ++i) {
c.idx[i] = idx[i] - b.idx[i];
}
return c;
}
/// Element-wise multiplication
CUTLASS_HOST_DEVICE
Coord operator*(Coord const& b) const {
Coord c;
for (int i = 0; i < N; ++i) {
c.idx[i] = idx[i] * b.idx[i];
}
return c;
}
/// Element-wise division
CUTLASS_HOST_DEVICE
Coord operator/(Coord const& b) const {
Coord c;
for (int i = 0; i < N; ++i) {
c.idx[i] = idx[i] / b.idx[i];
}
return c;
}
/// In-place addition
CUTLASS_HOST_DEVICE
Coord& operator+=(Coord const& b) {
for (int i = 0; i < N; ++i) {
idx[i] += b.idx[i];
}
return *this;
}
/// In-place subtraction
CUTLASS_HOST_DEVICE
Coord& operator-=(Coord const& b) {
for (int i = 0; i < N; ++i) {
idx[i] -= b.idx[i];
}
return *this;
}
/// In-place multiplication
CUTLASS_HOST_DEVICE
Coord& operator*=(Coord const& b) {
for (int i = 0; i < N; ++i) {
idx[i] *= b.idx[i];
}
return *this;
}
/// In-place division
CUTLASS_HOST_DEVICE
Coord& operator/=(Coord const& b) {
for (int i = 0; i < N; ++i) {
idx[i] /= b.idx[i];
}
return *this;
}
/// Member access operator
CUTLASS_HOST_DEVICE int& operator[](int dim) { return idx[dim]; }
/// Member access operator
CUTLASS_HOST_DEVICE int const& operator[](int dim) const { return idx[dim]; }
/// Computes the dot product of two Coord instances
template <typename T>
CUTLASS_HOST_DEVICE T dot(Coord const& b, T sum) const {
for (int i = 0; i < N; ++i) {
sum += idx[i] * b.idx[i];
}
return sum;
}
/// Computes the dot product of two Coord instances
template <typename T>
CUTLASS_HOST_DEVICE T dot(Coord const& b) const {
T sum = T(0);
for (int i = 0; i < N; ++i) {
sum += idx[i] * b.idx[i];
}
return sum;
}
/// Gets the index of a given Coord element
template <int Dim>
CUTLASS_HOST_DEVICE int& at() {
return idx[Dim];
}
/// Access via index; may limit unrolling potential
CUTLASS_HOST_DEVICE
int& at(int dim) { return idx[dim]; }
/// Gets the index of a given Coord element
template <int Dim>
CUTLASS_HOST_DEVICE int const& at() const {
return idx[Dim];
}
/// Access via index; may limit unrolling potential
CUTLASS_HOST_DEVICE
int const& at(int dim) const { return idx[dim]; }
/// Determines if two Coord<> objects are equal
CUTLASS_HOST_DEVICE
bool operator==(Coord<N> const& b) const {
bool equal = true;
for (int i = 0; equal && i < N; ++i) {
equal = (idx[i] == b.idx[i]);
}
return equal;
}
/// Not equal
CUTLASS_HOST_DEVICE
bool operator!=(Coord<N> const& b) const { return !(*this == b); }
/// Clamps a coordinate to a range specified by maximum and minimum values
CUTLASS_HOST_DEVICE
Coord& clamp(Coord<N> const& max, Coord<N> const& min = Coord<N>()) {
for (int i = 0; i < N; ++i) {
idx[i] = __NV_STD_MAX(__NV_STD_MIN(idx[i], max.idx[i]), min.idx[i]);
}
return *this;
}
/// Returns the product of all elements
CUTLASS_HOST_DEVICE
int count() const {
int product = idx[0];
for (int i = 1; i < N; ++i) {
product *= idx[i];
}
return product;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Helper to make a 2-element coordinate
CUTLASS_HOST_DEVICE
Coord<1> make_Coord(int _0) {
int values[1] = {_0};
return Coord<1>(values);
}
/// Helper to make a 2-element coordinate
CUTLASS_HOST_DEVICE
Coord<2> make_Coord(int _0, int _1) {
int values[2] = {_0, _1};
return Coord<2>(values);
}
/// Helper to make a 3-element coordinate
CUTLASS_HOST_DEVICE
Coord<3> make_Coord(int _0, int _1, int _2) {
int values[3] = {_0, _1, _2};
return Coord<3>(values);
}
/// Helper to make a 4-element coordinate
CUTLASS_HOST_DEVICE
Coord<4> make_Coord(int _0, int _1, int _2, int _3) {
int values[4] = {_0, _1, _2, _3};
return Coord<4>(values);
}
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Getter
CUTLASS_HOST_DEVICE
Coord<2> get_Coord_hw(Coord<3> const& coord) { return make_Coord(coord[1], coord[2]); }
/// Getter
CUTLASS_HOST_DEVICE
Coord<2> get_Coord_hw(Coord<4> const& coord) { return make_Coord(coord[1], coord[2]); }
/// Getter
CUTLASS_HOST_DEVICE
Coord<3> get_Coord_hwc(Coord<4> const& coord) { return make_Coord(coord[1], coord[2], coord[3]); }
/// Getter
CUTLASS_HOST_DEVICE
Coord<3> get_Coord_dhw(Coord<4> const& coord) { return make_Coord(coord[0], coord[1], coord[2]); }
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,44 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
/*! \file
\brief Helpers for printing cutlass/core objects
*/
#pragma once
#include <iosfwd>
#include <typeinfo>
#include <cutlass/coord.h>
template <int Rank>
std::ostream& operator<<(std::ostream& out, cutlass::Coord<Rank> const& coord) {
for (int i = 0; i < Rank; ++i) {
out << (i ? ", " : "") << coord.idx[i];
}
return out;
}

View File

@ -1,73 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Basic include for CUTLASS macros
*/
#pragma once
////////////////////////////////////////////////////////////////////////////////////////////////////
#define CUTLASS_MAJOR 1
#define CUTLASS_MINOR 0
#define CUTLASS_PATCH 0
#define CUTLASS_VERSION ((CUTLASS_MAJOR)*100 + (CUTLASS_MINOR)*10 + CUTLASS_PATCH)
#ifdef __NVCC__
#define CUTLASS_HOST_DEVICE __forceinline__ __device__ __host__
#define CUTLASS_DEVICE __forceinline__ __device__
#elif defined(__CUDACC_RTC__)
#define CUTLASS_HOST_DEVICE __forceinline__ __device__
#define CUTLASS_DEVICE __forceinline__ __device__
#else
#define CUTLASS_HOST_DEVICE
// CUTLASS_DEVICE is an error if not compiling device code
#endif
// CUTLASS_PRAGMA_UNROLL inserts a CUTLASS_PRAGMA_UNROLL if supported by the compiler
#if defined(__CUDA_ARCH__)
#if defined(_MSC_VER)
#define CUTLASS_PRAGMA_UNROLL __pragma("unroll")
#define CUTLASS_PRAGMA_NO_UNROLL __pragma("unroll 1")
#else
#define CUTLASS_PRAGMA_UNROLL _Pragma("unroll")
#define CUTLASS_PRAGMA_NO_UNROLL _Pragma("unroll 1")
#endif
#else
#define CUTLASS_PRAGMA_UNROLL
#define CUTLASS_PRAGMA_NO_UNROLL
#endif
#define CUTLASS_ASSERT(x) assert(x)
namespace cutlass {
/// NVIDIA GPU Warp size
static const int kWarpSize = 32;
} // namespace cutlass
////////////////////////////////////////////////////////////////////////////////////////////////////

View File

@ -1,278 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines Fragment, a statically-sized array for storing parts of matrices within a
thread's registers.
*/
#pragma once
#include <assert.h>
#include <cutlass/shape.h>
#include <cutlass/util/cutlass_math.h>
#include <cutlass/vector.h>
namespace cutlass {
///////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup fragment_concept Fragment Concept
@{
\ref fragment_concept is a statically sized array for storing parts of tiles held by individual CUDA
threads.
@par \ref fragment_concept
Types satisfying \ref fragment_concept define the following members
- <b>Element</b> - type of each access held within the fragment
- <b>kElements</b> - number of elements stored by the fragment
- <b>clear()</b> - overwrites the fragment storage with zeros
- <b>Element & operator[](int i)</b> - by-reference access of the ith element
- <b>Element const & operator[](int i) const</b> - const by-reference access of the ith element
@}
*/
///////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup fragment_iterator_concept Fragment Iterator Concept
@{
\ref fragment_iterator_concept provides structured access to the elements within a fragment with an
optional bitcast to the desired access type
@par \ref fragment_iterator_concept
Types satisfying \ref fragment_iterator_concept define the following members
- <b>AccessType& operator[](int i)</b> - provides access to the ith element of the fragment
- <b>AccessType& at(int d, int h, int w, int c)</b> - applies \ref layout_concept to fragment and
provides access to element at (d, h, w, c)
@}
*/
////////////////////////////////////////////////////////////////////////////////////////////////////
template <int kAlignment_>
struct StorageType {
typedef uint64_t Type;
};
template <>
struct StorageType<4> {
typedef uint32_t Type;
};
template <>
struct StorageType<2> {
typedef uint16_t Type;
};
template <>
struct StorageType<1> {
typedef uint8_t Type;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief A template defining \ref fragment_concept
* @concept{fragment_concept}
*/
template <typename Element_, int kElements_, size_t kAlignment_ = 16>
struct Fragment : public AlignedStruct<kAlignment_> {
/// Make sure the alignment makes sense wrt the size of elements.
static_assert(kAlignment_ == 16 || kAlignment_ >= sizeof(Element_), "Alignment is too small");
/// Alignment must be a power of two
static_assert(is_pow2<kAlignment_>::value, "Alignment must be a power of two");
/// This class.
typedef Fragment<Element_, kElements_> This_;
/// The element.
typedef Element_ Element;
/// The number of elements.
static int const kElements = kElements_;
/// Clear a fragment.
CUTLASS_DEVICE void clear() {
// Avoid element-wise access for sub 32b element type
if (kAlignment_ >= 8 && (kElements * sizeof(Element)) % 8 == 0) {
uint64_t* ptr = reinterpret_cast<uint64_t*>(storage);
for (int i = 0; i < (kElements * sizeof(Element)) / 8; ++i) {
ptr[i] = uint64_t(0);
}
} else if (kAlignment_ >= 4 && (kElements * sizeof(Element)) % 4 == 0) {
uint32_t* ptr = reinterpret_cast<uint32_t*>(storage);
for (int i = 0; i < (kElements * sizeof(Element)) / 4; ++i) {
ptr[i] = uint32_t(0);
}
} else if (kAlignment_ >= 2 && (kElements * sizeof(Element)) % 2 == 0) {
uint16_t* ptr = reinterpret_cast<uint16_t*>(storage);
for (int i = 0; i < (kElements * sizeof(Element)) / 2; ++i) {
ptr[i] = uint16_t(0);
}
} else {
for (int i = 0; i < kElements; ++i) {
storage[i] = 0;
}
}
}
/// The accessor.
CUTLASS_DEVICE Element& operator[](int i) {
assert(i < kElements_);
return reinterpret_cast<Element*>(storage)[i];
}
/// The accessor.
CUTLASS_DEVICE Element const& operator[](int i) const {
assert(i < kElements_);
return reinterpret_cast<Element const*>(storage)[i];
}
private:
/// Storage type to use for Elements
typedef typename StorageType<kAlignment_>::Type StorageType;
/// Number of elements in the storage
static int const kStorageCount =
(sizeof(Element_) * kElements_ + sizeof(StorageType) - 1) / sizeof(StorageType);
/// The storage.
StorageType storage[kStorageCount];
/// Ensure that there's enough storage for all elements
static_assert(sizeof(StorageType) <= kAlignment_, "StorageType is too big for given alignment");
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief A template defining \ref fragment_iterator_concept
* @concept{fragment_iterator_concept}
*/
template <typename Fragment_, typename Iterations_, typename AccessType_>
struct FragmentIterator {
/// This class.
typedef FragmentIterator<Fragment_, Iterations_, AccessType_> This_;
/// The fragment.
typedef Fragment_ Fragment;
/// The number of iterations.
typedef Iterations_ Iterations;
/// The access type.
typedef AccessType_ AccessType;
/// The element.
typedef typename Fragment::Element Element;
/// The number of elements per access.
static int const kElementsPerAccess = (int)(sizeof(AccessType) / sizeof(Element));
/// The shape of the the fragment.
typedef typename ShapeMul<Iterations, Shape<1, 1, 1, kElementsPerAccess> >::Shape FragmentShape;
/// The linear strides for iterations.
typedef typename ShapeStrides<FragmentShape>::Shape Strides;
/// Ctor.
template <typename OtherFragment_>
CUTLASS_DEVICE FragmentIterator(OtherFragment_& fragment, int offset = 0)
: pointer(reinterpret_cast<Element*>(&fragment[offset])) {
static_assert(OtherFragment_::kElements >= Fragment::kElements, "");
}
/// The accessor.
CUTLASS_DEVICE AccessType const& at(int d, int h, int w, int c = 0) const {
int const imm = ComputeOffsetFromStrides<Strides>::get(d, h, w, c);
return reinterpret_cast<AccessType const&>(pointer[imm]);
}
/// The accessor.
CUTLASS_DEVICE AccessType& at(int d, int h, int w, int c = 0) {
int const imm = ComputeOffsetFromStrides<Strides>::get(d, h, w, c);
return reinterpret_cast<AccessType&>(pointer[imm]);
}
/// The accessor.
CUTLASS_DEVICE AccessType const& operator[](int i) const {
return reinterpret_cast<AccessType const&>(pointer[i * kElementsPerAccess]);
}
/// The accessor.
CUTLASS_DEVICE AccessType& operator[](int i) {
return reinterpret_cast<AccessType&>(pointer[i * kElementsPerAccess]);
}
/// Is the iterator valid?
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
/// The pointer.
Element* pointer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Fragment_, typename Iterations_, typename AccessType_>
struct FragmentConstIterator {
/// This class.
typedef FragmentIterator<Fragment_, Iterations_, AccessType_> This_;
/// The fragment.
typedef Fragment_ Fragment;
/// The number of iterations.
typedef Iterations_ Iterations;
/// The access type.
typedef AccessType_ AccessType;
/// The element.
typedef typename Fragment::Element Element;
/// The number of elements per access.
static int const kElementsPerAccess = (int)(sizeof(AccessType) / sizeof(Element));
/// The shape of the the fragment.
typedef typename ShapeMul<Iterations, Shape<1, 1, 1, kElementsPerAccess> >::Shape FragmentShape;
/// The linear strides for iterations.
typedef typename ShapeStrides<FragmentShape>::Shape IterationsStrides;
/// Ctor.
template <typename OtherFragment_>
CUTLASS_DEVICE FragmentConstIterator(OtherFragment_& fragment, int offset = 0)
: pointer(reinterpret_cast<Element const*>(&fragment[offset])) {
static_assert(OtherFragment_::kElements >= Fragment::kElements, "");
}
/// Create from non-constant FragmentIterator
CUTLASS_DEVICE FragmentConstIterator(
FragmentIterator<Fragment_, Iterations_, AccessType_> const& rhs_)
: pointer(reinterpret_cast<Element const*>(rhs_.offset)) {}
/// The accessor.
CUTLASS_DEVICE AccessType const& at(int d, int h, int w, int c = 0) const {
int const imm = ComputeOffsetFromStrides<IterationsStrides>::get(d, h, w, c);
return reinterpret_cast<AccessType const&>(pointer[imm]);
}
/// The accessor.
CUTLASS_DEVICE AccessType const& operator[](int i) const {
return reinterpret_cast<AccessType const&>(pointer[i * kElementsPerAccess]);
}
/// Is the iterator valid?
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
/// The pointer.
Element const* pointer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,135 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines accessors for loading and storing fragments to memory efficiently.
*/
#pragma once
#include <cutlass/load_store.h>
#include <cutlass/vector.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <IteratorFragment::Kind kIteratorFragment,
int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentLoad {};
template <int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentLoad<IteratorFragment::kWmmaMatrix,
kAccessSize,
Scalar_,
Memory_,
FragmentElement_,
kStride> {
/// The output type.
typedef FragmentElement_ AccessType;
/// The load function.
static CUTLASS_DEVICE void load(AccessType& value, Scalar_ const* pointer, int offset) {
value.load(&pointer[offset], kStride);
}
};
template <int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentLoad<IteratorFragment::kScalar,
kAccessSize,
Scalar_,
Memory_,
FragmentElement_,
kStride> {
/// The output type.
typedef typename Vectorize<Scalar_, kAccessSize>::Type AccessType;
/// The load function.
static CUTLASS_DEVICE void load(AccessType& value, Scalar_ const* pointer, int offset) {
Load<Scalar_, kAccessSize, Memory_>::load(value, pointer, offset);
}
};
template <IteratorFragment::Kind kIteratorFragment,
int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentStore {};
template <int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentStore<IteratorFragment::kWmmaMatrix,
kAccessSize,
Scalar_,
Memory_,
FragmentElement_,
kStride> {
/// The input type.
typedef FragmentElement_ AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& value, Scalar_* pointer, int offset) {
value.store(&pointer[offset], kStride);
}
};
template <int kAccessSize,
typename Scalar_,
MemorySpace::Kind Memory_,
typename FragmentElement_,
int kStride>
struct FragmentStore<IteratorFragment::kScalar,
kAccessSize,
Scalar_,
Memory_,
FragmentElement_,
kStride> {
/// The input type.
typedef typename Vectorize<Scalar_, kAccessSize>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& value, Scalar_* pointer, int offset) {
Store<Scalar_, kAccessSize, Memory_>::store(value, pointer, offset);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} /// namespace cutlass

View File

@ -1,131 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines multiply-add operations on fragments within a thread.
*/
#pragma once
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_>
struct FragmentMultiplyAdd {
/// The shape of the instruction.
typedef Shape<1, 1, 1, 1> InstructionShape;
/// The type for A.
typedef Scalar_ ScalarA;
/// The type for B.
typedef Scalar_ ScalarB;
/// The type for C and D.
typedef Scalar_ ScalarC;
/// Ctor.
CUTLASS_DEVICE FragmentMultiplyAdd() {}
/// Multiply : d = a*b.
template <typename Fragment_>
CUTLASS_DEVICE void multiply(Scalar_ a, Fragment_ const& b, Fragment_& d) {
for (int j = 0; j < Fragment_::kElements; ++j) {
d[j] = a * b[j];
}
}
/// Multiply : d = a*b + c.
template <typename Fragment_>
CUTLASS_DEVICE void multiply_add(Scalar_ a,
Fragment_ const& b,
Fragment_ const& c,
Fragment_& d) {
for (int j = 0; j < Fragment_::kElements; ++j) {
d[j] = a * b[j] + c[j];
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
template <>
struct FragmentMultiplyAdd<half> {
/// The shape of the instruction.
typedef Shape<1, 1, 1, 1> InstructionShape;
/// The type for A.
typedef half ScalarA;
/// The type for B.
typedef half ScalarB;
/// The type for C and D.
typedef half ScalarC;
/// Ctor.
CUTLASS_DEVICE FragmentMultiplyAdd() {}
/// Multiply : d = a*b.
template <typename Fragment_>
CUTLASS_DEVICE void multiply(half a, Fragment_ const& b, Fragment_& d) {
#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
// The input.
__half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
// The output.
__half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
// Assemble a half2 from a.
__half2 const a_half2 = __half2half2(a);
for (int i = 0; i < Fragment_::kElements / 2; ++i) {
d_half2[i] = __hmul2(a_half2, b_half2[i]);
}
#endif
}
/// Multiply : d = a*b + c.
template <typename Fragment_>
CUTLASS_DEVICE void multiply_add(half a, Fragment_ const& b, Fragment_ const& c, Fragment_& d) {
#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
// The inputs.
__half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
__half2 const* c_half2 = reinterpret_cast<__half2 const*>(&c[0]);
// The output.
__half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
// Assemble a half2 from a.
__half2 const a_half2 = __half2half2(a);
for (int i = 0; i < Fragment_::kElements / 2; ++i) {
d_half2[i] = __hfma2(a_half2, b_half2[i], c_half2[i]);
}
#endif
}
};
#endif
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,55 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines abstractions for efficiently clearing accumulator tiles.
*/
#pragma once
#include <cutlass/vector.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int kLanes_ = 1>
struct ClearAccumulators {
/// The shared storage.
struct SharedStorage {};
/// Ctor.
CUTLASS_DEVICE ClearAccumulators(SharedStorage& shared_storage) {}
/// Clear the fragment.
template <typename Fragment_>
CUTLASS_DEVICE void clear(Fragment_& fragment) {
fragment.clear();
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,127 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines structural traits of double-precision GEMM.
*/
#pragma once
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/gemm_epilogue.h>
#include <cutlass/gemm/gemm_epilogue_traits.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/gemm/gemm_shared_tile.h>
#include <cutlass/gemm/gemm_traits.h>
#include <cutlass/gemm/thread_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The tile size for the GEMM KxNxM.
typename OutputTile_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_ = 1,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_ = 1>
struct DgemmConfig
: public GemmConfig<
/// The scalar type for A.
double,
/// The scalar type for B.
double,
/// The scalar type for C.
double,
/// The scalar type for D.
double,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, double, double, double>,
/// The number of scalars per LDG for A.
kScalarsPerLdgA_,
/// The number of scalars per STS for A.
kScalarsPerLdgA_,
/// The number of scalars per LDS for A.
2,
/// The number of scalars per LDG for B.
kScalarsPerLdgB_,
/// The number of scalars per STS for B.
kScalarsPerLdgB_,
/// The number of scalars per LDS for B.
2,
/// The number of scalars per LDG for C and STG for D.
1,
/// The number of scalars per STS for D.
2,
/// The number of scalars per LDS for D.
1,
/// The number of stages in shared memory.
2> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_ = Shape<8, 64, 128>,
/// The functor to use in the epilogue.
typename EpilogueFunctor_ = LinearScaling<double>,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<8, 8, 8>,
/// The number of doubles loaded in one LDG for A.
int kScalarsPerLdgA_ = 1,
/// The number of doubles loaded in one LDG for B.
int kScalarsPerLdgB_ = 1,
/// The index.
typename Index_ = int,
/// The DGEMM config.
typename GemmConfig_ =
DgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>,
/// The traits class for the epilogue.
typename GemmEpilogueTraits_ =
SimplifiedGemmEpilogueTraits<GemmConfig_, EpilogueFunctor_, Index_> >
struct DgemmTraits : public SimplifiedGemmTraits<
// The layout for A.
kLayoutA_,
// The layout for B.
kLayoutB_,
// The config.
GemmConfig_,
// The epilogue.
GemmEpilogue<GemmEpilogueTraits_>,
// The index.
Index_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,319 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements a software-pipelined efficient GEMM.
*/
#pragma once
#if !defined(__CUDACC_RTC__)
#include <cuda.h>
#endif
#include <cutlass/coord.h>
#include <cutlass/util/platform.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Gemm_>
__global__ void gemm_kernel(typename Gemm_::Params params) {
// Declare shared memory.
__shared__ typename Gemm_::SharedStorage shared_storage;
// Construct the GEMM object.
Gemm_ gemm(params, shared_storage);
// Run GEMM.
gemm.multiply_add();
}
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Index_ = int>
struct GemmDesc {
/// The dimensions of the GEMM.
Index_ m, n, k;
/// The alpha/beta scaling values.
Scalar_ alpha, beta;
/// The source matrix A.
void const* d_a;
/// The stride for A.
Index_ lda;
/// The source matrix B.
void const* d_b;
/// The stride for B.
Index_ ldb;
/// The source matrix C.
void const* d_c;
/// The stride for C.
Index_ ldc;
/// The destination matrix D.
void* d_d;
/// The stride for D.
Index_ ldd;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmTraits_>
struct Gemm {
/// This class.
typedef Gemm<GemmTraits_> This_;
/// The traits.
typedef GemmTraits_ Traits;
/// The shared storage.
typedef typename Traits::SharedStorage SharedStorage;
/// The scalar for A.
typedef typename Traits::ScalarA ScalarA;
/// The scalar for B.
typedef typename Traits::ScalarB ScalarB;
/// The scalar in the epilogue.
typedef typename Traits::Epilogue::Scalar ScalarEpilogue;
/// The scalar for C.
typedef typename Traits::Epilogue::ScalarC ScalarC;
/// The scalar for D.
typedef typename Traits::Epilogue::ScalarD ScalarD;
/// The index.
typedef typename Traits::Index Index;
/// The number of threads.
static int const kThreads = Traits::GemmConfig::kThreads;
/// The params.
struct Params : public Traits::Params {
CUTLASS_HOST_DEVICE int initialize(Index m,
Index n,
Index k,
ScalarEpilogue alpha,
ScalarA const* d_a,
Index lda,
ScalarB const* d_b,
Index ldb,
ScalarEpilogue beta,
ScalarC const* d_c,
Index ldc,
ScalarD* d_d,
Index ldd) {
GemmDesc<ScalarEpilogue, Index> desc;
desc.m = m;
desc.n = n;
desc.k = k;
desc.alpha = alpha;
desc.beta = beta;
desc.d_a = reinterpret_cast<void const*>(d_a);
desc.lda = lda;
desc.d_b = reinterpret_cast<void const*>(d_b);
desc.ldb = ldb;
desc.d_c = reinterpret_cast<void const*>(d_c);
desc.ldc = ldc;
desc.d_d = reinterpret_cast<void*>(d_d);
desc.ldd = ldd;
return Traits::Params::initialize(desc);
}
};
#if !defined(__CUDACC_RTC__)
/// Launch the kernel.
static __host__ cudaError_t launch(Params const& params,
cudaStream_t stream = cudaStreamDefault) {
// Setup the grid.
dim3 grid;
grid.x = (params.m + Traits::OutputTile::kW - 1) / Traits::OutputTile::kW;
grid.y = (params.n + Traits::OutputTile::kH - 1) / Traits::OutputTile::kH;
// The number of threads.
dim3 block;
block.x = kThreads;
// Launch the kernel.
void const* params_ = reinterpret_cast<void const*>(&params);
return cudaLaunchKernel(reinterpret_cast<void*>(&gemm_kernel<This_>),
grid,
block,
const_cast<void**>(&params_),
0,
stream);
}
/// Launch the kernel.
static __host__ cudaError_t launch(CUfunction kernel,
Params const& params,
CUstream stream = CU_STREAM_LEGACY) {
// Setup the grid.
dim3 grid;
grid.x = (params.m + Traits::OutputTile::kW - 1) / Traits::OutputTile::kW;
grid.y = (params.n + Traits::OutputTile::kH - 1) / Traits::OutputTile::kH;
// The number of threads.
dim3 block;
block.x = kThreads;
// Launch the kernel.
void* params_[] = {const_cast<void*>(reinterpret_cast<void const*>(&params))};
// return cudaLaunchKernel(reinterpret_cast<void*>(&gemm_kernel<This_>), grid, block,
// const_cast<void**>(&params_), 0, stream);
CUresult result = cuLaunchKernel(
kernel, grid.x, grid.y, grid.z, block.x, block.y, block.z, 0, stream, params_, 0);
if (result != CUDA_SUCCESS) {
return cudaErrorLaunchFailure;
}
return cudaSuccess;
}
#endif
/// Ctor.
CUTLASS_DEVICE Gemm(Params const& params_, SharedStorage& shared_storage_)
: params(params_), shared_storage(shared_storage_) {}
/// Do the GEMM.
CUTLASS_DEVICE void multiply_add() {
// Swizzle the IDs of the block (to enable better cache behavior).
typename Traits::BlockSwizzle block_swizzle;
dim3 block = block_swizzle.swizzle();
// Scale the id.
block.x *= Traits::OutputTile::kW;
block.y *= Traits::OutputTile::kH;
// We may want to use shared memory to clear the registers.
typedef typename Traits::ClearAccumulators ClearAccumulators;
// The streams to read A/B from global memory to shared memory.
typename Traits::GlobalLoadStream global_stream(params, shared_storage, block);
// Create the accumulator clear.
ClearAccumulators clear(shared_storage.main_loop.clear);
/// Define the mainloop iteration size
typedef typename Traits::MultiplyAdd MultiplyAdd;
// By how much we unroll the main loop.
Index const kUnroll = static_cast<Index>(MultiplyAdd::AccumulatorsPerWarp::kD);
// If we do not have enough steps in the main loop, trigger the residue code.
if (params.k < kUnroll) {
global_stream.residue(params.k, true);
}
// Fetch the fragments for A and B from global memory.
global_stream.copy();
// Copy the elements to shared memory (after transformation if needed).
global_stream.commit();
// Make sure the data is in shared memory.
Traits::shared_store_fence(false);
// The unrolling steps for the main loop.
int const kUnrollingSteps =
MultiplyAdd::AccumulatorsPerWarp::kD / MultiplyAdd::InstructionShape::kD;
// Make sure we have at least 2 unrolling steps or our pipeling is not going to work.
static_assert(kUnrollingSteps >= 2, "The pipelining assumes at least two steps");
// The stream of data from shared memory to fragments.
typename Traits::SharedLoadStream shared_load_stream(params, shared_storage);
// Trigger the copy from shared memory for the 1st stream.
shared_load_stream.copy(0);
// Allocate the accumulators.
typename MultiplyAdd::Accumulators accumulators;
// Clear the accumulators.
clear.clear(accumulators);
// Enter the main loop and iterate.
typedef typename Traits::Index Index;
for (Index outer_k = params.k - kUnroll; outer_k > -kUnroll; outer_k -= kUnroll) {
// If that's the last "load iteration" update the predicates.
int const is_residue = outer_k <= kUnroll;
if (is_residue) {
global_stream.residue(outer_k);
}
// Load data for the next iteration of the main loop.
global_stream.copy();
CUTLASS_PRAGMA_UNROLL
for (int step = 0; step < kUnrollingSteps - 1; ++step) {
// Trigger the copy from shared memory for the next A/B values.
shared_load_stream.copy(step + 1);
// Make sure the values are available for the current iteration to do the multiply-add.
shared_load_stream.commit(step);
// Do the math on the fragments of the current iteration.
MultiplyAdd multiply_add;
multiply_add.multiply_add(shared_load_stream.fragment_a(step),
shared_load_stream.fragment_b(step),
accumulators,
accumulators);
}
// Make sure the data from shared memory has been entirely consumed.
Traits::shared_load_fence(true);
// Commit the data in shared memory for A/B.
global_stream.commit();
// Make sure the data is in shared memory.
Traits::shared_store_fence(true);
// Move to the next stage for the load (if it makes sense).
shared_load_stream.inc_stage();
// Trigger the copy from shared memory for the next loop iteration.
shared_load_stream.copy(0);
// Make sure the values are available for the current iteration to do the multiply-add.
shared_load_stream.commit(kUnrollingSteps - 1);
// Do the math on the fragments of the current iteration.
MultiplyAdd multiply_add;
multiply_add.multiply_add(shared_load_stream.fragment_a(kUnrollingSteps - 1),
shared_load_stream.fragment_b(kUnrollingSteps - 1),
accumulators,
accumulators);
}
// Epilogue.
typedef typename Traits::Epilogue Epilogue;
Epilogue epilogue(params.epilogue, shared_storage.epilogue, params.m, params.n);
epilogue.epilogue(cutlass::make_Coord(0, block.y, block.x), accumulators);
}
/// The params.
Params const& params;
/// The shared storage.
SharedStorage& shared_storage;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,225 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements the epilogue phase of the GEMM kernel that efficiently updates global memory
with
the computed matrix product.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/coord.h>
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename T>
CUTLASS_DEVICE bool is_zero(T x) {
return x == T(0);
}
#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
CUTLASS_DEVICE bool is_zero(half x) { return reinterpret_cast<int16_t&>(x) == int16_t(0); }
#endif
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmEpilogueTraits_>
struct GemmEpilogue {
/// The traits class.
typedef GemmEpilogueTraits_ Traits;
/// The params.
typedef typename Traits::Params Params;
/// The shared storage.
typedef typename Traits::SharedStorage SharedStorage;
/// The output tile.
typedef typename Traits::OutputTile OutputTile;
/// The number of iterations.
typedef typename Traits::Iterations Iterations;
/// The accumulators.
typedef typename Traits::Accumulators Accumulators;
/// The scalar.
typedef typename Traits::Scalar Scalar;
/// The functor in charge of the math.
typedef typename Traits::Functor Functor;
/// We do not support 3D or 4D shapes.
static_assert(Iterations::kD == 1 && Iterations::kC == 1, "Unsupported 3D/4D shapes");
/// The iterator for C in global memory.
typedef typename Traits::GlobalLoadIteratorC GlobalLoadIteratorC;
/// The transformer for C.
typedef typename Traits::GlobalTransformerC GlobalTransformerC;
/// The transformer for D.
typedef typename Traits::GlobalTransformerD GlobalTransformerD;
/// The iterator for D in global memory.
typedef typename Traits::GlobalStoreIteratorD GlobalStoreIteratorD;
/// The iterator to store D in shared memory.
typedef typename Traits::SharedStoreIteratorD SharedStoreIteratorD;
/// The shared store transformer for D.
typedef typename Traits::SharedStoreTransformerD SharedStoreTransformerD;
/// The iterator to load D in shared memory.
typedef typename Traits::SharedLoadIteratorD SharedLoadIteratorD;
/// The shared load transformer for D.
typedef Copy<typename SharedLoadIteratorD::Fragment> SharedLoadTransformerD;
/// The index.
typedef typename Traits::Index Index;
/// The scalar for C.
typedef typename GlobalLoadIteratorC::Scalar ScalarC;
/// The scalar for D.
typedef typename GlobalStoreIteratorD::Scalar ScalarD;
/// Ctor.
CUTLASS_DEVICE GemmEpilogue(Params const& params_,
SharedStorage& shared_storage_,
Index m_,
Index n_)
: params(params_), shared_storage(shared_storage_), m(m_), n(n_) {}
/// Execute the epilogue.
CUTLASS_DEVICE void epilogue(Coord<3> const& block, Accumulators& accumulators) {
if (is_zero(params.functor.beta)) {
epilogue_with_or_without_beta<true>(block, accumulators);
} else {
epilogue_with_or_without_beta<false>(block, accumulators);
}
}
template <bool kBetaIsZero_>
CUTLASS_DEVICE void epilogue_with_or_without_beta(Coord<3> const& block,
Accumulators& accumulators) {
Coord<3> const bounds = cutlass::make_Coord(0, n, m);
// The functor.
Functor functor(params.functor);
// The C fragment.
typename GlobalLoadIteratorC::Fragment fragment_c;
// The transformed C fragment.
typename GlobalTransformerC::OutputFragment transformed_c;
CUTLASS_PRAGMA_UNROLL
for (int h = 0; h < Iterations::kH; ++h) {
// Compute pointer and predicate offsets for C and D global iterators.
int const pointer_offset =
((params.iterator_d.inc_h * (GlobalStoreIteratorD::Iterations::kH - 1) +
params.iterator_d.inc_advance) *
Iterations::kW +
params.stride_h) *
h;
int const predicate_offset =
((params.iterator_d.predicate_inc_h * (GlobalStoreIteratorD::Iterations::kH - 1) +
params.iterator_d.predicate_inc_advance) *
Iterations::kW +
Traits::Delta::kH) *
h;
// The iterator to load the elements of the C matrix.
GlobalLoadIteratorC global_load_iterator(
params.iterator_c, bounds, block, pointer_offset, predicate_offset);
// The transformer for C.
GlobalTransformerC transformer_c;
// The transformer for D.
GlobalTransformerD transformer_d;
// The iterator to store into the D matrix.
GlobalStoreIteratorD global_store_iterator(
params.iterator_d, bounds, block, pointer_offset, predicate_offset);
CUTLASS_PRAGMA_UNROLL
for (int w = 0; w < Iterations::kW; ++w) {
// Load the C matrix into fragment.
if (!kBetaIsZero_) {
iterator_load(global_load_iterator, fragment_c);
}
// Make sure we can write to shared memory.
shared_load_fence();
// Copy the accumulators to shared memory.
int const offset = (h * Iterations::kW + w) * SharedStoreIteratorD::Fragment::kElements;
SharedStoreTransformerD shared_store_transformer;
typename SharedStoreTransformerD::OutputFragment shared_store_transformed_d;
shared_store_transformer.transform(accumulators, offset, shared_store_transformed_d);
SharedStoreIteratorD shared_store_iterator(params.shared_store_iterator_d,
shared_storage.shared_stream.store);
shared_iterator_store(shared_store_iterator, shared_store_transformed_d);
// Make sure the data is in shared memory.
shared_store_fence();
// Copy the accumulators back to registers from shared memory.
SharedLoadIteratorD shared_load_iterator(params.shared_load_iterator_d,
shared_storage.shared_stream.load);
typename SharedLoadIteratorD::Fragment fetched_d;
shared_iterator_load(shared_load_iterator, fetched_d);
// Do the math.
typename GlobalTransformerD::InputFragment fragment_d;
if (kBetaIsZero_) {
functor.evaluate(fetched_d, fragment_d);
} else {
// Transform C fragment.
transformer_c.transform(fragment_c, transformed_c);
// Do the math.
functor.evaluate(fetched_d, transformed_c, fragment_d);
}
// Transform D fragment.
typename GlobalTransformerD::OutputFragment transformed_d;
transformer_d.transform(fragment_d, transformed_d);
// Copy the results to global memory.
iterator_store(global_store_iterator, transformed_d);
}
}
}
/// The memory fence for shared loads.
CUTLASS_DEVICE void shared_load_fence() { __syncthreads(); }
/// The memory fence for shared stores.
CUTLASS_DEVICE void shared_store_fence() { __syncthreads(); }
/// The params.
Params const& params;
/// The shared storage.
SharedStorage& shared_storage;
/// The dimensions of the GEMM.
Index m, n;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,331 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines structural properties of the GEMM epilogue.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/coord.h>
#include <cutlass/gemm/gemm_global_stream.h>
#include <cutlass/gemm/gemm_shared_stream.h>
#include <cutlass/gemm/linear_scaling.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/tile_iterator.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The output tile.
typename OutputTile_,
/// The accumulators.
typename Accumulators_,
/// The iterator to load C from global memory.
typename GlobalLoadIteratorC_,
/// The transformer for C.
typename GlobalTransformerC_,
/// The transformer for D.
typename GlobalTransformerD_,
/// The iterator to store D to global memory.
typename GlobalStoreIteratorD_,
/// The iterator to store D to shared memory.
typename SharedStoreIteratorD_,
/// The shared store transformer for D.
typename SharedStoreTransformerD_,
/// The iterator to load D from shared memory.
typename SharedLoadIteratorD_,
/// The number of iterations in the epilogue.
typename Iterations_,
/// The iterations strides.
typename Delta_,
/// The functor to be used in the epilogue.
typename Functor_,
/// The index.
typename Index_ = int>
struct GemmEpilogueTraits {
//
/// The output tile.
typedef OutputTile_ OutputTile;
/// The number of iterations.
/// The accumulators.
typedef Accumulators_ Accumulators;
/// The iterator for C in global memory.
typedef GlobalLoadIteratorC_ GlobalLoadIteratorC;
/// The transformer for C.
typedef GlobalTransformerC_ GlobalTransformerC;
/// The transformer for D.
typedef GlobalTransformerD_ GlobalTransformerD;
/// The iterator for D in global memory.
typedef GlobalStoreIteratorD_ GlobalStoreIteratorD;
/// The iterator to store D in shared memory.
typedef SharedStoreIteratorD_ SharedStoreIteratorD;
/// The shared store transformer for D.
typedef SharedStoreTransformerD_ SharedStoreTransformerD;
/// The iterator to store D in shared memory.
typedef SharedLoadIteratorD_ SharedLoadIteratorD;
/// typedef typename GemmConfig::EpilogueIterations Iterations;
typedef Iterations_ Iterations;
/// The iterations strides.
typedef Delta_ Delta;
/// The functor in charge of the math.
typedef Functor_ Functor;
/// The index.
typedef Index_ Index;
/// We do not support 3D or 4D shapes.
static_assert(Iterations::kD == 1 && Iterations::kC == 1, "Unsupported 3D/4D shapes");
/// The scalar.
typedef typename Functor::Scalar Scalar;
/// The scalar for C.
typedef typename GlobalLoadIteratorC::Scalar ScalarC;
/// The scalar for D.
typedef typename GlobalStoreIteratorD::Scalar ScalarD;
/// The params.
struct Params {
/// The strides for H and W in the different iterations of the epilogue.
Index stride_h, stride_w;
/// The params for the C iterator.
typename GlobalLoadIteratorC::Params iterator_c;
/// The params for the D global iterator.
typename GlobalStoreIteratorD::Params iterator_d;
/// The params for the D shared store iterator.
typename SharedStoreIteratorD::Params shared_store_iterator_d;
/// The params for the D shared load iterator.
typename SharedLoadIteratorD::Params shared_load_iterator_d;
/// The functor params.
typename Functor::Params functor;
/// Setup the params.
template <typename GemmDesc_>
CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
// The parameters for the functor.
int error_code = functor.initialize(desc);
if (error_code) {
return error_code;
}
// At the end of the H iteration, we jump over a number of columns.
this->stride_h = desc.ldd * Delta::kH;
// Nothing to do here.
this->stride_w = 0;
// Setup the params for the global memory iterator for C.
error_code = iterator_c.initialize(
reinterpret_cast<ScalarC const*>(desc.d_c), desc.ldc, desc.n, stride_w, Delta::kW);
if (error_code) {
return error_code;
}
// Setup the params for the global memory iterator for D.
return iterator_d.initialize(
reinterpret_cast<ScalarD*>(desc.d_d), desc.ldd, desc.n, stride_w, Delta::kW);
}
};
/// The shared memory storage to exchange data.
union StreamSharedStorage {
// The storage for the store iterator.
typename SharedStoreIteratorD::SharedStorage store;
// The storage for the store iterator.
typename SharedLoadIteratorD::SharedStorage load;
};
/// The shared memory to swizzle the data in the epilogue.
struct SharedStorage {
// The storage for the shared stream D.
StreamSharedStorage shared_stream;
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_, typename EpilogueFunctor_, typename Index_ = int>
struct GemmEpilogueTraitsHelper {
/// The scalar.
typedef typename EpilogueFunctor_::Scalar Scalar;
/// The output tile.
typedef typename GemmConfig_::OutputTile OutputTile;
/// The number of iterations in the epilogue.
typedef Shape<1,
GemmConfig_::MultiplyAdd::AccumulatorsPerThread::kH /
GemmConfig_::kAccumulatorsPerLdsB,
GemmConfig_::kAccumulatorsPerLdsB>
Iterations;
// The iteration strides in the H/W dimension.
typedef Shape<0,
GemmConfig_::kAccumulatorsPerLdsB*(
GemmConfig_::Warps::kH* GemmConfig_::MultiplyAdd::ThreadsPerWarp::kH - 1),
0>
Delta;
/// The functor to do the math in the epilogue.
typedef EpilogueFunctor_ Functor;
/// The traits class to build the iterator to store to shared memory for D.
typedef GemmSharedStoreTileDTraits<
// The pointer is float.
typename Functor::Scalar,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The number of scalars per STS.
GemmConfig_::kScalarsPerStsD,
// The skew -- 128 / sizeof(ScalarD) / kScalarsPerStsD is the number of threads involved in
// a single STS. We divide by 2 as our objective is to add a skew to the odd threads to
// avoid bank conflicts between odd and even threads.
128 / sizeof(typename GemmConfig_::ScalarD) / GemmConfig_::kScalarsPerStsD / 2 *
GemmConfig_::kScalarsPerStsD>
SharedStoreTileTraits;
/// The iterator to store D to shared memory.
typedef TileStoreIterator<SharedStoreTileTraits,
typename SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorD;
/// The shared store transformer for D.
typedef Copy<typename SharedStoreIteratorD::Fragment> SharedStoreTransformerD;
/// The traits class to build the iterator to load from shared memory for D.
typedef GemmSharedLoadTileDTraits<
// The pointer is float.
typename Functor::Scalar,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The number of columns of the output tile written by iteration.
GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsD,
// The skew.
SharedStoreTileTraits::kSkew>
SharedLoadTileTraits;
/// The iterator to load D from shared memory.
typedef TileLoadIterator<SharedLoadTileTraits,
typename SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorD;
/// The traits class to build the iterator to load data from global memory for C^N.
typedef GemmGlobalTileCdTraits<
// The pointer is float const.
typename GemmConfig_::ScalarC const,
// The tile has size (N / Iterations)xM in GEMM's terminology.
Shape<1,
GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// How many elements do we jump over at each iteration?
Iterations::kW,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgC>
GlobalLoadTileTraits;
/// The iterator to load C.
typedef GemmGlobalIteratorCd<GlobalLoadTileTraits, Index_> GlobalLoadIteratorC;
/// The transformer for C.
typedef Copy<typename GlobalLoadIteratorC::Fragment> GlobalTransformerC;
/// The traits class to build the iterator to store data to global memory for D^N.
typedef GemmGlobalTileCdTraits<
// The pointer is float.
typename GemmConfig_::ScalarD,
// The tile has size (N / Iterations)xM in GEMM's terminology.
Shape<1,
GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// How many elements do we jump over at each iteration?
Iterations::kW,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerStgD>
GlobalStoreTileTraits;
/// The iterator to store D.
typedef GemmGlobalIteratorCd<GlobalStoreTileTraits, Index_> GlobalStoreIteratorD;
/// The transformer for D.
typedef Copy<typename GlobalStoreIteratorD::Fragment> GlobalTransformerD;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The GEMM config.
typename GemmConfig_,
/// The epilogue functor to do the math in the epilogue.
typename EpilogueFunctor_,
/// The index.
typename Index_ = int,
/// The helper to create the traits class.
typename Helper_ = GemmEpilogueTraitsHelper<GemmConfig_, EpilogueFunctor_, Index_> >
struct SimplifiedGemmEpilogueTraits : public GemmEpilogueTraits<
// The output tile.
typename GemmConfig_::OutputTile,
// The accumulators.
typename GemmConfig_::Accumulators,
// The global iterator for C.
typename Helper_::GlobalLoadIteratorC,
// The transformer for C.
typename Helper_::GlobalTransformerC,
// The transformer for D.
typename Helper_::GlobalTransformerD,
// The global iterator for D.
typename Helper_::GlobalStoreIteratorD,
// The iterator to store D to shared memory.
typename Helper_::SharedStoreIteratorD,
// The shared store transformer for D.
typename Helper_::SharedStoreTransformerD,
// The iterator to load D from shared memory.
typename Helper_::SharedLoadIteratorD,
// The number of iterations.
typename Helper_::Iterations,
// The strides between iterations.
typename Helper_::Delta,
// The functor to be used in the epilogue.
EpilogueFunctor_,
// The index.
Index_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,175 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements efficient loading of the thread block-level tile from global memory and
storing
to shared memory.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/iterator_access.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The load iterator.
typename LoadIterator_,
/// The store iterator to copy to shared memory.
typename StoreIterator_,
/// The transformer to be applied after the data has been copied from global memory.
typename Transformer_>
struct GlobalLoadStreamBase {
/// The load iterator.
typedef LoadIterator_ LoadIterator;
/// The transformer.
typedef Transformer_ Transformer;
/// The store iterator to write to shared memory.
typedef StoreIterator_ StoreIterator;
/// The fragment that is copied from shared memory.
typedef typename LoadIterator::Fragment FetchedFragment;
/// The fragment that is obtained after the transformation by the transformer.
typedef typename Transformer::OutputFragment TransformedFragment;
/// Make sure the fragments match.
static_assert((platform::is_same<FetchedFragment, typename Transformer::InputFragment>::value),
"");
/// The output fragment.
typedef TransformedFragment Fragment;
/// Make sure the transformed fragment is the same as the store fragment.
static_assert((platform::is_same<TransformedFragment, typename StoreIterator::Fragment>::value),
"");
/// The layout.
static MatrixLayout::Kind const kLayout = LoadIterator::kLayout;
/// The scalar type of the iterator.
typedef typename LoadIterator::Scalar Scalar;
/// The pointer.
typedef typename LoadIterator::Pointer Pointer;
/// The index.
typedef typename LoadIterator::Index Index;
/// The params.
struct Params {
// The load iterator.
typename LoadIterator::Params load_iterator;
// The store iterator.
typename StoreIterator::Params store_iterator;
/// Setup the params.
CUTLASS_HOST_DEVICE int initialize(Pointer pointer, Index ld) {
int error_code = load_iterator.initialize(pointer, ld);
if (error_code) {
return error_code;
}
return store_iterator.initialize();
}
};
/// The amount of storage in shared memory needed to store the tile.
typedef typename StoreIterator::SharedStorage SharedStoreStorage;
/// The storage in shared memory needed by that stream.
union SharedStorage {
// The load iterator.
typename LoadIterator::SharedStorage load_iterator;
// The store iterator.
SharedStoreStorage store_iterator;
};
/// Ctor.
CUTLASS_DEVICE GlobalLoadStreamBase(Params const& params,
SharedStorage& shared_storage,
Coord<3> const bounds,
Coord<3> const& block)
: load_iterator(params.load_iterator, bounds, block),
transformer(),
store_iterator(params.store_iterator, shared_storage.store_iterator)
{
fetched_fragment.clear();
}
/// Load the data from shared memory to the fetch fragment.
CUTLASS_DEVICE void copy() { iterator_load(load_iterator, fetched_fragment); }
/// Commit the data.
CUTLASS_DEVICE void commit() {
transformer.transform(fetched_fragment, transformed_fragment);
iterator_store(store_iterator, transformed_fragment);
store_iterator.inc_stage();
}
/// Execute the residue code.
CUTLASS_DEVICE void residue(Index k, bool skip_clear = false) {
load_iterator.residue(k);
if (!skip_clear) {
fetched_fragment.clear();
}
}
/// The iterator.
LoadIterator load_iterator;
/// The fragment to fetch from shared memory.
FetchedFragment fetched_fragment;
/// The transformer.
Transformer transformer;
/// The fragment to convert the data after it has been fetched from shared memory.
TransformedFragment transformed_fragment;
/// The store iterator.
StoreIterator store_iterator;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The load iterator.
typename LoadIterator_,
/// The store iterator to copy to shared memory.
typename StoreIterator_,
/// The transformer to be applied after the data has been copied from global memory.
typename Transformer_ = Copy<typename LoadIterator_::Fragment> >
struct GlobalLoadStream : public GlobalLoadStreamBase<LoadIterator_, StoreIterator_, Transformer_> {
/// The base class.
typedef GlobalLoadStreamBase<LoadIterator_, StoreIterator_, Transformer_> Base;
/// Ctor.
CUTLASS_DEVICE GlobalLoadStream(typename Base::Params const& params,
typename Base::SharedStorage& shared_storage,
Coord<3> const& bounds,
Coord<3> const& block)
: Base(params, shared_storage, bounds, block) {}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,478 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines iterators for efficiently loading and storing to global memory.
*/
#pragma once
#include <cutlass/coord.h>
#include <cutlass/util/platform.h>
#include <cutlass/gemm/gemm_operand.h>
#include <cutlass/matrix_traits.h>
#include <cutlass/predicate_vector.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/tile_iterator.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
// The following functor reshapes a tile of threads to match a tile of data. The idea is that when
// the user wants to build the iterator traits, he/she may want to specify the tile independently
// from the number of scalars loaded/stored per instruction. For example, in the row-major version
// with a tile of size 128x8 - the user may want to that the iterator works with 32x8 threads if
// each thread loads 1 scalar per LDG. If the user changes to 4 scalars per LDG, then the tile of
// threads has to change. The code below detects that and correct the code automatically - it is
// a helper when the user does not specify the right configuration.
template <typename Tile_, typename Threads_, bool = (Tile_::kW < Threads_::kW)>
struct ReshapeThreads {
typedef Threads_ Threads;
};
template <typename Tile_, typename Threads_>
struct ReshapeThreads<Tile_, Threads_, true> {
typedef Shape<Threads_::kD, Threads_::kH * Threads_::kW / Tile_::kW, Tile_::kW, 1> Threads;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <GemmOperand::Kind kOperand_,
MatrixLayout::Kind kLayout_,
typename Scalar_,
typename Tile_,
typename Threads_,
int kAccessSize_>
struct GemmGlobalTileTraits {
/// Identity of the operand
static GemmOperand::Kind const kOperand = kOperand_;
/// The layout.
static MatrixLayout::Kind const kLayout = kLayout_;
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The number of scalars per LDG/STG.
static int const kAccessSize = kAccessSize_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kGlobal;
/// The tile shape
typedef typename ReshapeTile<Tile_, kAccessSize_>::Tile Tile;
/// The threads shape
typedef typename ReshapeThreads<Tile, Threads_>::Threads Threads;
/// The relative offset between two elements in the H/W dimension in adjacent threads.
typedef Shape<1, 1, Tile::kC> ThreadsDelta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, Threads::kH, Threads::kW * kAccessSize> Delta;
/// Strides for immediate offset computation
typedef Shape<0, 0, Threads::kW * ThreadsDelta::kW, kAccessSize> ImmediateOffsetStrides;
/// The number of iterations needed to load/store the tile.
typedef Shape<1, Tile::kH / Threads::kH, Tile::kW / Threads::kW, Tile::kC / kAccessSize>
Iterations;
typedef GemmMultiplicandTraits<Tile, kOperand, kLayout> MultiplicandTraits;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
return make_Coord(0, thread_offset_h, thread_offset_w, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Tile_, typename Threads_, int kStrideH_, int kAccessSize_>
struct GemmGlobalTileCdTraits : public GemmGlobalTileTraits<GemmOperand::kC,
MatrixLayout::kColumnMajor,
Scalar_,
Tile_,
Threads_,
kAccessSize_> {
/// The base class.
typedef GemmGlobalTileTraits<GemmOperand::kC,
MatrixLayout::kColumnMajor,
Scalar_,
Tile_,
Threads_,
kAccessSize_>
Base;
/// The stride in the H dimension.
static int const kStrideH = kStrideH_;
/// Override the strides in each dimension between different loads/stores.
typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> Delta;
typedef typename Base::Iterations Iterations;
typedef typename Base::Threads Threads;
typedef typename Base::ThreadsDelta ThreadsDelta;
typedef typename Base::ImmediateOffsetStrides ImmediateOffsetStrides;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int thread_offset_h = threadIdx.x / Threads::kW * kStrideH * Iterations::kH;
int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
return make_Coord(0, thread_offset_h, thread_offset_w, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename TileTraits_, typename Index_ = int>
struct GemmGlobalIteratorAb
: public TileLoadIterator<TileTraits_,
typename TileTraits_::Scalar,
TileTraits_::MultiplicandTraits::kKstrided ? IteratorAdvance::kH
: IteratorAdvance::kW,
MemorySpace::kGlobal,
Index_> {
/// This class.
typedef GemmGlobalIteratorAb<TileTraits_, Index_> This_; /// The base class.
typedef TileLoadIterator<TileTraits_,
typename TileTraits_::Scalar,
TileTraits_::MultiplicandTraits::kKstrided ? IteratorAdvance::kH
: IteratorAdvance::kW,
MemorySpace::kGlobal,
Index_>
Base;
/// The layout.
static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
/// Fragment type loaded by the iterator
typedef typename Base::Fragment Fragment;
/// The scalar.
typedef typename TileTraits_::Scalar Scalar;
/// The threads.
typedef typename TileTraits_::Threads Threads;
/// The index.
typedef Index_ Index;
/// The thread offset
typedef typename TileTraits_::ThreadOffset ThreadOffset;
/// Specifies in which dimension post-increment accesses advance.
static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
typedef cutlass::PredicateVector<ShapeCount<typename Base::Iterations>::kCount> PredicateVector;
/// Iterator parameters type
typedef typename Base::Params BaseParams;
struct Params : public BaseParams {
/// Initializes params to load a strip-mined tile, given pointer and stride_h.
CUTLASS_HOST_DEVICE int initialize(Scalar const* ptr, Index stride_h) {
Index inc_d = 0;
Index inc_advance = 0;
// Move by some columns for each iteration in the H dimension.
Index inc_h = Base::Delta::kH * stride_h;
// Move by some more columns in the number of iterations if the D dimension is > 1.
if (Base::Delta::kD > 0) {
inc_d = Base::Delta::kD * stride_h - (Base::Iterations::kH - 1) * inc_h;
}
// Move to the beginning of the next iteration.
if (kAdvance == IteratorAdvance::kH && Base::Delta::kD > 0) {
inc_advance = inc_d;
} else if (kAdvance == IteratorAdvance::kH) {
inc_advance = inc_h;
} else if (Base::Delta::kD > 0) {
inc_advance = (Base::Iterations::kW + 0) * ShapeCount<typename Base::Delta>::kWc -
(Base::Iterations::kH - 1) * inc_h -
(Base::Iterations::kD - 1) * Base::Delta::kD * stride_h;
} else {
inc_advance = (Base::Iterations::kW + 0) * ShapeCount<typename Base::Delta>::kWc -
(Base::Iterations::kH - 1) * inc_h;
}
Base::Params::initialize(ptr, 0, stride_h, 0, inc_d, inc_h, 0, inc_advance);
return 0;
}
};
/// Offset of an individual lane from the start of the tile
Coord<4> thread_offset;
/// The parameters
Params params;
CUTLASS_DEVICE void initialize_predicates(const Coord<3>& bounds, const Coord<3>& block) {
// Setup the masks to control loads.
predicates.fill(0);
int bounds_h, bounds_w;
if (kAdvance == IteratorAdvance::kH) {
bounds_w = bounds[2] - block[2];
bounds_h = bounds[1];
} else {
bounds_w = bounds[1];
bounds_h = bounds[2] - block[1];
}
// Fill in the bits of the predicate vector.
for (int d = 0; d < Base::Iterations::kD; ++d) {
for (int h = 0; h < Base::Iterations::kH; ++h) {
for (int w = 0; w < Base::Iterations::kW; ++w) {
for (int c = 0; c < Base::Iterations::kC; ++c) {
bool flag = w * Base::Delta::kW < bounds_w;
if (kAdvance == IteratorAdvance::kH) {
flag = flag && (h * Base::Delta::kH + d * Base::Delta::kD) < bounds_h;
} else {
flag = flag && (h * Base::Delta::kH) < bounds_h;
}
int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
predicates.set(bit, flag);
}
}
}
}
}
/// Ctor.
CUTLASS_DEVICE GemmGlobalIteratorAb(Params const& _params,
const Coord<3>& bounds,
const Coord<3>& block,
ThreadOffset thread_offset_func = ThreadOffset())
: params(_params) {
thread_offset = thread_offset_func();
// The column.
Index block_h = thread_offset[1];
// The contiguous dimension.
Index block_w = thread_offset[2];
// Add the blocks indices.
if (kAdvance == IteratorAdvance::kH) {
block_h += block[1];
block_w += block[2];
} else {
block_h += block[2];
block_w += block[1];
}
// Setup the pointer.
params.pointer += (block_h * params.stride_h + block_w);
// Initialize predicates
initialize_predicates(bounds, make_Coord(0, block_h, block_w));
}
/// Increment the pointer in the H dimension.
CUTLASS_DEVICE void inc_h() { params.pointer += params.inc_h; }
/// Increment the pointer in the D dimension.
CUTLASS_DEVICE void inc_d() { params.pointer += params.inc_d; }
/// Increment the pointer to move to the next iteration.
CUTLASS_DEVICE void inc_advance() { params.pointer += params.inc_advance; }
/// Returns the current pointer
CUTLASS_HOST_DEVICE
Scalar const* data() const { return params.pointer; }
/// That's the residue! Update the predicates.
CUTLASS_DEVICE void residue(Index k) {
// The coordinates of the thread.
Index block_h = thread_offset[1];
// The contiguous dimension.
Index block_w = thread_offset[2];
// Update the predicate vector.
for (int d = 0; d < Base::Iterations::kD; ++d) {
for (int h = 0; h < Base::Iterations::kH; ++h) {
for (int w = 0; w < Base::Iterations::kW; ++w) {
for (int c = 0; c < Base::Iterations::kC; ++c) {
Index offset = 0;
if (kAdvance == IteratorAdvance::kH) {
offset += block_h + h * Base::Delta::kH + d * Base::Delta::kD;
} else {
offset += block_w + w * Base::Delta::kW;
}
int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
if (offset >= k) {
predicates.set(bit, false);
}
}
}
}
}
}
/// Is the iterator valid?
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
int const bit = ComputeOffsetFromShape<typename Base::Iterations>::get(d, h, w, c);
return predicates[bit];
}
/// The predicates.
PredicateVector predicates;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename TileTraits_, typename Index_ = int>
struct GemmGlobalIteratorCd : public TileIteratorBase<TileTraits_,
typename TileTraits_::Scalar,
IteratorAdvance::kH,
MemorySpace::kGlobal,
Index_> {
/// This class.
typedef GemmGlobalIteratorCd<TileTraits_, Index_> This_;
/// The base class.
typedef TileIteratorBase<TileTraits_,
typename TileTraits_::Scalar,
IteratorAdvance::kH,
MemorySpace::kGlobal,
Index_>
Base;
/// The layout.
static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
/// The scalar.
typedef typename TileTraits_::Scalar Scalar;
/// The pointer.
typedef typename TileTraits_::Pointer Pointer;
/// The threads.
typedef typename TileTraits_::Threads Threads;
/// The index.
typedef Index_ Index;
/// The thread offset
typedef typename TileTraits_::ThreadOffset ThreadOffset;
/// The params.
struct Params {
/// The pointer.
Pointer pointer;
/// The stride in the H dimension to setup the thread in the block.
Index stride_h;
/// The strides to increment the pointer.
Index inc_advance, inc_h;
/// The strides to increment the predicate offset
Index predicate_inc_advance, predicate_inc_h;
/// The column offset to compute the predicate for the columns.
Index predicate_offset;
/// Setup the params.
CUTLASS_HOST_DEVICE int initialize(
Pointer pointer, Index ld, Index bound, Index epilogue_stride_w, Index epilogue_delta_w) {
// The pointer.
this->pointer = pointer;
// Each column of the matrix.
stride_h = TileTraits_::ThreadsDelta::kH * ld;
// Each thread output 1 column per iteration. The stride between columns is given by the
// number of scalars that are loaded per LDS for B.
inc_h = ld * TileTraits_::kStrideH;
inc_advance =
(ld - ld * TileTraits_::kStrideH * (Base::Iterations::kH - 1)) + epilogue_stride_w;
predicate_offset = bound;
predicate_inc_h = TileTraits_::kStrideH;
predicate_inc_advance =
-((TileTraits_::kStrideH * (Base::Iterations::kH - 1) - 1) + epilogue_delta_w);
return 0;
}
};
Params params;
/// Offset of an individual lane from the start of the tile
Coord<4> thread_offset;
/// Ctor.
CUTLASS_DEVICE GemmGlobalIteratorCd() {}
/// Ctor.
CUTLASS_DEVICE GemmGlobalIteratorCd(Params const& params,
const Coord<3>& bounds,
const Coord<3>& block,
int offset = 0,
int pred_offset = 0,
ThreadOffset thread_offset_func = ThreadOffset())
: params(params) {
thread_offset = thread_offset_func();
// Each warp works on a different column of the tile.
int const h = thread_offset[1] + block[1];
// Each lane writes a different element.
int const w = thread_offset[2] + block[2];
// Setup the pointer.
this->params.pointer += ((h * params.stride_h + w) + offset);
// Prepare the vector of predicates.
for (int i = 0; i < Base::Iterations::kW; ++i) {
predicates.set(i, w + i * Base::Delta::kW < bounds[2]);
}
this->params.predicate_offset -= (h + pred_offset);
}
/// Increment the pointer in the C dimension.
CUTLASS_DEVICE void inc_c() {}
/// Increment the pointer in the W dimension.
CUTLASS_DEVICE void inc_w() {}
/// Increment the pointer in the H dimension.
CUTLASS_DEVICE void inc_h() {
params.pointer += params.inc_h;
params.predicate_offset -= params.predicate_inc_h;
}
/// Increment the pointer in the D dimension.
CUTLASS_DEVICE void inc_d() {}
/// Increment the pointer to move to the next iteration.
CUTLASS_DEVICE void inc_advance() {
params.pointer += params.inc_advance;
this->params.predicate_offset -= params.predicate_inc_advance;
}
/// Test the validity of the iterator.
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
return predicates.at(w) && params.predicate_offset > 0;
}
/// Returns the raw pointer
CUTLASS_HOST_DEVICE
Pointer data() { return params.pointer; }
CUTLASS_HOST_DEVICE
Pointer const data() const { return params.pointer; }
/// The predicates for the row.
cutlass::PredicateVector<Base::Iterations::kW> predicates;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,141 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear
memory.
*/
#pragma once
#include <cutlass/matrix_traits.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/util/platform.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Helper to describe attributes of GEMM matrix operands
template <GemmOperand::Kind kOperand_, MatrixLayout::Kind kLayout_>
struct GemmOperandTraitsAb {
static const bool Congruous =
(kOperand_ == GemmOperand::kA ^ kLayout_ == MatrixLayout::kRowMajor);
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmOperand::Kind kOperand_, typename Tile_>
struct GetExtent;
template <typename Tile_>
struct GetExtent<GemmOperand::kA, Tile_> {
static const int kExtent = Tile_::kW;
};
template <typename Tile_>
struct GetExtent<GemmOperand::kB, Tile_> {
static const int kExtent = Tile_::kH;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Determines the shape of a multiplicand tile in terms of strided (H) and contiguous (W)
/// dimensions
template <typename ThreadBlockTile_, GemmOperand::Kind Usage, MatrixLayout::Kind Layout>
struct GemmMultiplicandTraits {
// Only defined for A or B
static_assert(Usage == GemmOperand::kA || Usage == GemmOperand::kB,
"MultiplicandTileShape defined only for A or B operands.");
/// Shape of GEMM thread block tile (K, N, M)
typedef ThreadBlockTile_ ThreadBlockTile;
/// Identifies multiplicand
static GemmOperand::Kind const kUsage = Usage;
/// Layout of tile
static MatrixLayout::Kind const kLayout = Layout;
// True if K is the strided dimension
static bool const kKstrided = (kUsage == GemmOperand::kA ^ kLayout == MatrixLayout::kRowMajor);
/// Map the ThreadBlockShape onto (kH, kW) dimensions for A and B operand
typedef typename platform::conditional<
kKstrided,
Shape<1, ThreadBlockTile::kD, GetExtent<Usage, ThreadBlockTile>::kExtent>,
Shape<1, GetExtent<Usage, ThreadBlockTile>::kExtent, ThreadBlockTile::kD> >::type Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Project's a coordinate (K, N, M) onto inner and outer dimensions defined for each
/// operand.
template <GemmOperand::Kind operand, bool Kstrided = true>
struct ProjectOperand;
/// Project A operand - (0, K, M)
template <bool Kstrided>
struct ProjectOperand<GemmOperand::kA, Kstrided> {
CUTLASS_HOST_DEVICE
static Coord<3> project(Coord<3> const &coord) {
if (Kstrided) {
return make_Coord(0, coord[0], coord[2]);
} else {
return make_Coord(0, coord[2], coord[0]);
}
}
};
/// Project B operand - (0, K, N)
template <bool Kstrided>
struct ProjectOperand<GemmOperand::kB, Kstrided> {
CUTLASS_HOST_DEVICE
static Coord<3> project(Coord<3> const &coord) {
if (Kstrided) {
return make_Coord(0, coord[0], coord[1]);
} else {
return make_Coord(0, coord[1], coord[0]);
}
}
};
/// Project C operand - (0, N, M)
template <>
struct ProjectOperand<GemmOperand::kC, true> {
CUTLASS_HOST_DEVICE
static Coord<3> project(Coord<3> const &coord) { return make_Coord(0, coord[1], coord[2]); }
};
/// Project D operand - (0, N, M)
template <>
struct ProjectOperand<GemmOperand::kD, true> {
CUTLASS_HOST_DEVICE
static Coord<3> project(Coord<3> const &coord) { return make_Coord(0, coord[1], coord[2]); }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,113 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines abstractions for managing loading and storing fragments to shared memory in the
efficient GEMM pipeline.
*/
#pragma once
#include <cutlass/gemm/gemm_shared_tile.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The load iterator.
typename Iterator_,
/// The transformer to be applied after the data has been copied from shared memory.
typename Transformer_ = Copy<typename Iterator_::Fragment> >
struct SharedLoadStream {
/// The load iterator.
typedef Iterator_ Iterator;
/// The transformer.
typedef Transformer_ Transformer;
/// The fragment that is copied from shared memory.
typedef typename Iterator::Fragment FetchedFragment;
/// The fragment that is obtained after the transformation by the transformer.
typedef typename Transformer::OutputFragment TransformedFragment;
/// Make sure the fragments match.
static_assert((platform::is_same<FetchedFragment, typename Transformer::InputFragment>::value),
"");
/// The output fragment.
typedef TransformedFragment Fragment;
/// The params.
struct Params {
/// The iterator params.
typename Iterator::Params iterator;
/// Setup the params.
CUTLASS_HOST_DEVICE int initialize() { return iterator.initialize(); }
};
/// The storage in shared memory needed by that stream.
typedef typename Iterator::Storage SharedStorage;
/// Ctor.
CUTLASS_DEVICE SharedLoadStream() {}
/// Ctor.
CUTLASS_DEVICE SharedLoadStream(Params const &params, SharedStorage &shared_storage) {
this->initialize(params, shared_storage);
}
/// Initialize the stream.
CUTLASS_DEVICE void initialize(Params const &params, SharedStorage &shared_storage) {
// The iterator.
iterator = Iterator(params.iterator, shared_storage);
// The transformer.
transformer = Transformer();
}
/// Load the data from shared memory to the fetch fragment.
CUTLASS_DEVICE void copy(FetchedFragment &fetched) { shared_iterator_load(iterator, fetched); }
/// Load the data from shared memory to the fetch fragment.
CUTLASS_DEVICE void copy(int d, FetchedFragment &fetched) {
shared_iterator_load(iterator, fetched, d);
}
/// Commit the data.
CUTLASS_DEVICE void commit(FetchedFragment &fetched, TransformedFragment &transformed) {
transformer.transform(fetched, transformed);
}
/// Increment the stage.
CUTLASS_DEVICE void inc_stage() { iterator.inc_stage(); }
/// The iterator.
Iterator iterator;
/// The transformer.
Transformer transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,406 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines iterators for efficiently loading and storing tiles to and from shared memory.
*/
#pragma once
#include <cutlass/gemm/gemm_operand.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerSts_>
struct GemmSharedStoreTileAbTraits {
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The tile.
typedef typename ReshapeTile<Tile_, kScalarsPerSts_>::Tile Tile;
/// The threads.
typedef Threads_ Threads;
/// The strides to compute the base position of the thread.
typedef Shape<0, ShapeCount<Tile>::kWc, Tile::kC, kScalarsPerSts_> ThreadsStrides;
/// The skew.
static int const kSkew = 0;
/// The number of scalars per LDG/STG.
static int const kAccessSize = kScalarsPerSts_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of iterations needed to load/store the tile.
typedef Shape<1,
Tile::kH / Threads::kH,
Tile::kW / Threads::kW,
Tile::kC / Threads::kC / kAccessSize>
Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kAccessSize> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kAccessSize>
ImmediateOffsetStrides;
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerSts_, int kSkew_>
struct GemmSharedStoreWithSkewTileAbTraits {
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The tile without skews.
typedef typename ReshapeTile<Tile_, kScalarsPerSts_>::Tile TileWithoutSkew;
/// The tile.
typedef typename ReshapeTile<Shape<Tile_::kD, Tile_::kH, Tile_::kW + kSkew_>,
kScalarsPerSts_>::Tile Tile;
/// The threads.
typedef Threads_ Threads;
/// The skew.
static int const kSkew = kSkew_;
/// The number of scalars per STS.
static int const kAccessSize = kScalarsPerSts_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of iterations needed to load/store the tile.
typedef Shape<1, TileWithoutSkew::kH / Threads::kW, TileWithoutSkew::kW / Threads::kH> Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, ShapeCount<Tile>::kWc, Threads::kH * kAccessSize> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, ShapeCount<Tile>::kWc, Threads::kH * kAccessSize> ImmediateOffsetStrides;
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
return make_Coord(0, 0, offset, 0);
}
};
protected:
/// The strides to compute the base position of the thread.
typedef Shape<0, kScalarsPerSts_, ShapeCount<Tile>::kHwc / Threads::kW> ThreadsStrides;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
typename OutputTile_,
typename Warps_,
typename ThreadsPerWarp_,
typename InstructionShape_,
int kStages_,
int kScalarsPerLds_,
int kSkew_ = 0>
struct GemmSharedLoadTileATraits {
static GemmOperand::Kind const kOperand = GemmOperand::kA;
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The tile without skew.
typedef Shape<kStages_,
OutputTile_::kD / InstructionShape_::kD,
GetExtent<kOperand, OutputTile_>::kExtent * InstructionShape_::kD>
TileWithoutSkew_;
/// The tile with skew.
typedef Shape<kStages_, TileWithoutSkew_::kH, TileWithoutSkew_::kW + kSkew_> TileWithSkew;
/// The tile without skew after reshaping.
typedef typename ReshapeTile<TileWithoutSkew_, kScalarsPerLds_>::Tile TileWithoutSkew;
/// The tile.
typedef typename ReshapeTile<TileWithSkew, kScalarsPerLds_>::Tile Tile;
/// The number of warps.
typedef Warps_ Warps;
/// The threads in a warp.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of scalars per LDG/STG.
// static int const kScalarsPerLds = kScalarsPerLds_;
static int const kAccessSize = kScalarsPerLds_;
/// The skew.
static int const kSkew = kSkew_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of warps.
static int const kWarps = GetExtent<kOperand, Warps>::kExtent;
/// The number of threads in one dimension of the warp.
static int const kThreadsPerWarp = GetExtent<kOperand, ThreadsPerWarp>::kExtent;
/// The number of iterations needed to load/store the tile.
typedef Shape<1, 1, TileWithoutSkew::kW / kWarps / kThreadsPerWarp /* / kScalarsPerLds*/>
Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0>
ImmediateOffsetStrides;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// Extract the warp.
int const warp = threadIdx.x / kWarpSize % Warps::kW;
// Compute the row offset for each thread
int const lane = (threadIdx.x & 0x0e) / 2;
// The offset.
int const offset = (warp * ThreadsPerWarp::kW + lane) * kAccessSize;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
typename OutputTile_,
typename Warps_,
typename ThreadsPerWarp_,
typename InstructionShape_,
int kStages_,
int kScalarsPerLds_,
int kSkew_ = 0>
struct GemmSharedLoadTileBTraits {
static GemmOperand::Kind const kOperand = GemmOperand::kB;
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The tile without skew.
typedef Shape<kStages_,
OutputTile_::kD / InstructionShape_::kD,
GetExtent<kOperand, OutputTile_>::kExtent * InstructionShape_::kD>
TileWithoutSkew_;
/// The tile with skew.
typedef Shape<kStages_, TileWithoutSkew_::kH, TileWithoutSkew_::kW + kSkew_> TileWithSkew;
/// The tile without skew after reshaping.
typedef typename ReshapeTile<TileWithoutSkew_, kScalarsPerLds_>::Tile TileWithoutSkew;
/// The tile.
typedef typename ReshapeTile<TileWithSkew, kScalarsPerLds_>::Tile Tile;
/// The number of warps.
typedef Warps_ Warps;
/// The threads in a warp.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of scalars per LDG/STG.
static int const kAccessSize = kScalarsPerLds_;
/// The skew.
static int const kSkew = kSkew_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of warps.
static int const kWarps = GetExtent<kOperand, Warps>::kExtent;
/// The number of threads in one dimension of the warp.
static int const kThreadsPerWarp = GetExtent<kOperand, ThreadsPerWarp>::kExtent;
/// The number of iterations needed to load/store the tile.
typedef Shape<1, 1, TileWithoutSkew::kW / kWarps / kThreadsPerWarp /* / kAccessSize*/> Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<TileWithSkew::kW, 0, kWarps * kThreadsPerWarp * kAccessSize, 0>
ImmediateOffsetStrides;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// The position of the warp.
int const warp = threadIdx.x / (Warps::kW * kWarpSize);
// Compute the column offset for each thread
int const lane = (threadIdx.x & 0x10) / 8 + (threadIdx.x & 0x01);
// The offset.
int const offset = (warp * ThreadsPerWarp::kH + lane) * kAccessSize;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
typename OutputTile_,
typename Warps_,
typename ThreadsPerWarp_,
int kScalarsPerSts_,
int kSkew_ = 0>
struct GemmSharedStoreTileDTraits {
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The dimension of the output tile.
typedef OutputTile_ OutputTile;
/// The warps in the tile.
typedef Warps_ Warps;
/// The threads in the warps.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of scalars per LDG/STG.
static int const kAccessSize = kScalarsPerSts_;
/// The skew.
static int const kSkew = kSkew_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of scalars per thread.
static int const kScalarsPerThread = OutputTile_::kW / Warps::kW / ThreadsPerWarp::kW;
/// The number of threads.
static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
/// The number of scalars per row. We build a tile with 2 rows (to avoid bank conflicts).
static int const kScalarsPerRow = kThreads / 2 * kScalarsPerThread + kSkew;
/// The tile.
typedef Shape<1, 2, kScalarsPerRow / kAccessSize, kAccessSize> Tile;
/// The number of iterations needed to store the tile.
typedef Shape<1, 1, kScalarsPerThread / kAccessSize> Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, 0, Warps::kW * ThreadsPerWarp::kW * kAccessSize> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, 0, Warps::kW * ThreadsPerWarp::kW * kAccessSize> ImmediateOffsetStrides;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// We issue STS.128 in the epilogue to store the accumulators to shared memory. When we use
// STS.128, we have to guarantee that threads in groups of 8 do not have bank conflicts (i.e
// they write to different banks).
// Odd threads go to the second half of shared memory.
int const row = threadIdx.x & 0x01;
int const warp_id = (threadIdx.x >> 5);
int const warp_row = (warp_id % Warps::kW);
int const warp_col = (warp_id / Warps::kW);
int hi_halfwarp_offset = OutputTile::kW * ((threadIdx.x >> 4) & 1);
int lo_halfwarp_offset = (((threadIdx.x >> 1) & 0x7) + warp_row * ThreadsPerWarp::kW);
int col = kAccessSize * lo_halfwarp_offset +
warp_col * (ThreadsPerWarp::kH / 2) * OutputTile::kW + hi_halfwarp_offset;
int offset = row * kScalarsPerRow + col;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
typename OutputTile_,
typename Warps_,
typename ThreadsPerWarp_,
int kTileH_,
int kScalarsPerLds_,
int kSkew_ = 0>
struct GemmSharedLoadTileDTraits {
/// The scalar.
typedef typename platform::remove_const<Scalar_>::type Scalar;
/// The pointer.
typedef Scalar_* Pointer;
/// The dimension of the output tile.
typedef OutputTile_ OutputTile;
/// The warps in the tile.
typedef Warps_ Warps;
/// The threads in the warps.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of scalars per LDG/STG.
static int const kAccessSize = kScalarsPerLds_;
/// The skew.
static int const kSkew = kSkew_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The number of scalars per thread.
static int const kScalarsPerThread = OutputTile_::kW / Warps::kW / ThreadsPerWarp::kW;
/// The number of threads.
static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
/// The number of scalars per row. We build a tile with 2 rows (to avoid bank conflicts).
static int const kScalarsPerRow = kThreads / 2 * kScalarsPerThread + kSkew;
/// The tile.
typedef Shape<1, 2, kScalarsPerRow / kAccessSize, kAccessSize> Tile;
// Compute the number of iterations per warp in the Tile::kH dimension.
static int const kIterationsInHPerWarp = kTileH_ / ShapeCount<Warps>::kCount;
// As shown above, the shared memory tile is composed of 2 rows and each rows is made of
// kScalarsPerRow. A warp is expected to read from the 1st row, then move to the 2nd row and go
// back to the 1st row. To model that scheme we define the Iterations shape as Shape<X, 2, ...>.
// However, in some cases, we have only 1 iteration per warp. In that case, we must define the
// shape as Shape<1, 1, ...>. The following code does that.
static int const kIterationsH = kIterationsInHPerWarp == 1 ? 1 : 2;
// As soon as we know kIterationsH, it is trivial to compute kIterationsD:
static int const kIterationsD = kIterationsInHPerWarp / kIterationsH;
/// The number of iterations needed to store the tile.
typedef Shape<kIterationsD, kIterationsH, OutputTile::kW / kWarpSize / kAccessSize> Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<OutputTile::kW, kScalarsPerRow, kWarpSize * kAccessSize> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<OutputTile::kW, kScalarsPerRow, kWarpSize * kAccessSize> ImmediateOffsetStrides;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// Each warp works on a different column.
int const h = threadIdx.x / kWarpSize;
// Compute the row.
int const w = (threadIdx.x & (kWarpSize - 1)) * kAccessSize;
int offset = 0;
if (Iterations::kH == 1) {
int const row = h & 0x1;
int const col = h / 2;
offset = row * ShapeCount<Tile>::kWc + col * OutputTile::kW * Iterations::kD + w;
} else {
offset = h * OutputTile::kW * Iterations::kD + w;
}
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,747 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines structural properties of complete GEMM computation.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/gemm/clear_accumulators.h>
#include <cutlass/gemm/gemm_global_stream.h>
#include <cutlass/gemm/gemm_operand.h>
#include <cutlass/gemm/gemm_shared_stream.h>
#include <cutlass/gemm/identity_block_swizzle.h>
#include <cutlass/matrix_traits.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/tile_iterator.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The scalar type for A.
typename ScalarA_,
/// The scalar type for B.
typename ScalarB_,
/// The scalar type for C.
typename ScalarC_,
/// The scalar type for D.
typename ScalarD_,
/// The output tile size for the GEMM KxNxM.
typename OutputTile_,
/// The functor to do the math.
typename MultiplyAdd_,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_,
/// The number of scalars per STS for A.
int kScalarsPerStsA_,
/// The number of scalars per LDG for A.
int kScalarsPerLdsA_,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_,
/// The number of scalars per STS for B.
int kScalarsPerStsB_,
/// The number of scalars per LDS for B.
int kScalarsPerLdsB_,
/// The number of scalars per LDG for C and STG for D.
int kScalarsPerLdgCAndStgD_,
/// The number of scalars per STS for D.
int kScalarsPerStsD_,
/// The number of scalars per LDS for D.
int kScalarsPerLdsD_,
/// The number of stages in shared memory to do single/double/triple-buffering.
int kStages_>
struct GemmConfig {
//
/// The scalar for A.
typedef ScalarA_ ScalarA;
/// The scalar for B.
typedef ScalarB_ ScalarB;
/// The scalar for C.
typedef ScalarC_ ScalarC;
/// The scalar for D.
typedef ScalarD_ ScalarD;
/// The tile.
typedef OutputTile_ OutputTile;
/// The functor to do D = A*B + C.
typedef MultiplyAdd_ MultiplyAdd;
/// The shape of the instruction.
typedef typename MultiplyAdd::InstructionShape InstructionShape;
/// The number of accumulators per warp.
typedef typename MultiplyAdd::AccumulatorsPerWarp AccumulatorsPerWarp;
/// The accumulators.
typedef typename MultiplyAdd::Accumulators Accumulators;
/// The number of warps.
typedef typename ShapeDiv<OutputTile, AccumulatorsPerWarp>::Shape Warps;
/// The default warp size (32 threads per warp).
static int const kWarpSize = cutlass::kWarpSize;
/// The numnber of threads.
static int const kThreads = ShapeCount<Warps>::kCount * kWarpSize;
/// The number of scalars per LDG/STS/LDS for A.
static int const kScalarsPerLdgA = kScalarsPerLdgA_;
static int const kScalarsPerStsA = kScalarsPerStsA_;
static int const kScalarsPerLdsA = kScalarsPerLdsA_;
/// The number of scalars per LDG/STS/LDS for B.
static int const kScalarsPerLdgB = kScalarsPerLdgB_;
static int const kScalarsPerStsB = kScalarsPerStsB_;
static int const kScalarsPerLdsB = kScalarsPerLdsB_;
/// The number of scalars per LDG for C.
static int const kScalarsPerLdgC = kScalarsPerLdgCAndStgD_;
/// The number of scalars per STS/LDS/STG for D.
static int const kScalarsPerStgD = kScalarsPerLdgCAndStgD_;
static int const kScalarsPerStsD = kScalarsPerStsD_;
static int const kScalarsPerLdsD = kScalarsPerLdsD_;
/// The number of accumulators that are going to be fed from one LDS A/B.
static int const kAccumulatorsPerLdsA = kScalarsPerLdsA / InstructionShape::kD;
static int const kAccumulatorsPerLdsB = kScalarsPerLdsB / InstructionShape::kD;
/// The number of stages in shared memory to implement double, triple, more-buffering.
static int const kStages = kStages_;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind, typename GemmConfig_>
struct GemmTileTraitsHelperA {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarA Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
/// The traits class to build the iterator to load data from global memory for A^N.
typedef GemmGlobalTileTraits<
// That's A.
GemmOperand::kA,
// A is column-major.
MatrixLayout::kColumnMajor,
// The pointer is float const.
Scalar const,
// The tile has size KxM in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgA>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for A^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer is float.
MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsA>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for A^N.
typedef GemmSharedLoadTileATraits<
// The pointer is float const.
MultiplyAddScalar const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsA,
// The skew.
0>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarA Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
/// The traits class to build the iterator to load data from global memory for A^T.
typedef GemmGlobalTileTraits<
// That's A.
GemmOperand::kA,
// A is row-major.
MatrixLayout::kRowMajor,
// The pointer is float const.
Scalar const,
// The tile has size MxK in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgA>
GlobalTileTraits;
/// The number of scalars in 4B.
static int const kScalarsIn4B = sizeof(MultiplyAddScalar) > 4 ? 1 : 4 / sizeof(MultiplyAddScalar);
/// The traits class to build the iterator to store data to shared memory for A^T.
typedef GemmSharedStoreWithSkewTileAbTraits<
// The pointer is float.
MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS.
GemmConfig_::kScalarsPerStsA,
// The skew to avoid bank conflicts added in the tile W dimension.
128 / sizeof(MultiplyAddScalar) / GemmConfig_::kScalarsPerStsA /
GlobalTileTraits::Threads::kW * kScalarsIn4B>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for A^T.
typedef GemmSharedLoadTileATraits<
// The pointer is float const.
MultiplyAddScalar const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsA,
// The skew.
SharedStoreTileTraits::kSkew>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind, typename GemmConfig_>
struct GemmTileTraitsHelperB {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarB Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
/// The traits class to build the iterator to load data from global memory for B^N.
typedef GemmGlobalTileTraits<
// That's B.
GemmOperand::kB,
// B is column-major.
MatrixLayout::kColumnMajor,
// The pointer is float const.
Scalar const,
// The tile has size MxK in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgB>
GlobalTileTraits;
/// The number of scalars in 4B.
static int const kScalarsIn4B = sizeof(MultiplyAddScalar) > 4 ? 1 : 4 / sizeof(MultiplyAddScalar);
/// The traits class to build the iterator to store data to shared memory for B^N.
typedef GemmSharedStoreWithSkewTileAbTraits<
// The pointer is float.
MultiplyAddScalar,
// The tile has size KxN in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS.
GemmConfig_::kScalarsPerStsB,
// The skew to avoid bank conflicts added in the tile W dimension.
128 / sizeof(MultiplyAddScalar) / GemmConfig_::kScalarsPerStsB /
GlobalTileTraits::Threads::kW * kScalarsIn4B>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for B^N.
typedef GemmSharedLoadTileBTraits<
// The pointer is float const.
MultiplyAddScalar const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsB,
// The skew.
SharedStoreTileTraits::kSkew>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarB Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
/// The traits class to build the iterator to load data from global memory for B^T.
typedef GemmGlobalTileTraits<
// That's B.
GemmOperand::kB,
// B is row-major.
MatrixLayout::kRowMajor,
// The pointer is float const.
Scalar const,
// The tile has size KxN in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kH>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgB>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for B^T.
typedef GemmSharedStoreTileAbTraits<
// The pointer is float.
MultiplyAddScalar,
// The tile has size KxN in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsB>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for B^T.
typedef GemmSharedLoadTileBTraits<
// The pointer is float const.
MultiplyAddScalar const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsB,
// The skew.
0>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The GEMM configuration.
typename GemmConfig_,
/// The stream to load A from global memory to shared memory.
typename GlobalLoadStreamA_,
/// The stream to load B from global memory to shared memory.
typename GlobalLoadStreamB_,
/// The stream to load A from shared memory.
typename SharedLoadStreamA_,
/// The stream to load B from shared memory.
typename SharedLoadStreamB_,
/// The epilogue.
typename Epilogue_,
/// The block swizzle to reorganize the grid.
typename BlockSwizzle_ = IdentityBlockSwizzle,
/// The index.
typename Index_ = int,
/// The tool used to clear accumulators.
typename ClearAccumulators_ = ClearAccumulators<typename GemmConfig_::Accumulators::Scalar> >
struct GemmTraits {
/// The configuration.
typedef GemmConfig_ GemmConfig;
/// The output tile.
typedef typename GemmConfig::OutputTile OutputTile;
/// The stream to load A from global memory to shared memory.
typedef GlobalLoadStreamA_ GlobalLoadStreamA;
/// The layout of A.
static MatrixLayout::Kind const kLayoutA = GlobalLoadStreamA::kLayout;
/// The scalar for A.
typedef typename GlobalLoadStreamA_::Scalar ScalarA;
/// The stream to load B from global memory to shared memory.
typedef GlobalLoadStreamB_ GlobalLoadStreamB;
/// The layout of B.
static MatrixLayout::Kind const kLayoutB = GlobalLoadStreamB::kLayout;
/// The scalar for B.
typedef typename GlobalLoadStreamB_::Scalar ScalarB;
/// The iterator for A to load from shared memory.
typedef SharedLoadStreamA_ SharedLoadStreamA;
/// The iterator for B to load from shared memory.
typedef SharedLoadStreamB_ SharedLoadStreamB;
/// The shared storage for A.
typedef typename GlobalLoadStreamA::SharedStoreStorage SharedStoreStorageA;
// Btw, make sure we did not messed up with the size of the storage.
static_assert(sizeof(SharedStoreStorageA) == sizeof(typename SharedLoadStreamA::SharedStorage),
"");
/// The shared storage for B.
typedef typename GlobalLoadStreamB::SharedStoreStorage SharedStoreStorageB;
// Btw, make sure we did not messed up with the size of the storage.
static_assert(sizeof(SharedStoreStorageB) == sizeof(typename SharedLoadStreamB::SharedStorage),
"");
/// The multiply-add functor.
typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
/// The epilogue.
typedef Epilogue_ Epilogue;
/// The scalars in the epilogue.
typedef typename Epilogue::ScalarC ScalarC;
typedef typename Epilogue::ScalarD ScalarD;
/// The block swizzle to reorganize the grid.
typedef BlockSwizzle_ BlockSwizzle;
/// The index.
typedef Index_ Index;
/// Clear the accumulators.
typedef ClearAccumulators_ ClearAccumulators;
/// The params.
struct Params {
/// The dimensions of the GEMM.
Index m, n, k;
/// The params for the A stream.
typename GlobalLoadStreamA::Params global_stream_a;
/// The params for the B stream.
typename GlobalLoadStreamB::Params global_stream_b;
/// The params for the A stream from shared memory.
typename SharedLoadStreamA::Params shared_stream_a;
/// The params for the B stream from shared memory.
typename SharedLoadStreamB::Params shared_stream_b;
/// The params for the epilogue.
typename Epilogue::Params epilogue;
/// Initialize the parameters.
template <typename GemmDesc_>
CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
// Set the problem size.
this->m = desc.m;
this->n = desc.n;
this->k = desc.k;
// Initialize the iterator for A.
int error_code =
global_stream_a.initialize(reinterpret_cast<ScalarA const*>(desc.d_a), desc.lda);
if (error_code) {
return error_code;
}
// Initialize the iterator for B.
error_code = global_stream_b.initialize(reinterpret_cast<ScalarB const*>(desc.d_b), desc.ldb);
if (error_code) {
return error_code;
}
// The epilogue.
return epilogue.initialize(desc);
}
};
// The storage for A.
template <typename GlobalLoadStream_, typename SharedLoadStream_>
union StreamSharedStorage {
// The storage needed by the global stream.
typename GlobalLoadStream_::SharedStorage global;
// The storage needed by the shared stream.
typename SharedLoadStream_::SharedStorage shared;
};
// The storage for the main loop + prologue.
struct MainLoopSharedStorage {
// The storage to shuffle the A matrix in shared memory.
StreamSharedStorage<GlobalLoadStreamA, SharedLoadStreamA> stream_a;
// The storage to shuffle the B matrix in shared memory.
StreamSharedStorage<GlobalLoadStreamB, SharedLoadStreamB> stream_b;
// The storage to clear the accumulators if needed.
typename ClearAccumulators::SharedStorage clear;
};
/// The storage in shared memory.
union SharedStorage {
// The storage for the main loop.
MainLoopSharedStorage main_loop;
// The storage for the epilogue.
typename Epilogue::SharedStorage epilogue;
};
/// Assemble the global load streams for A/B.
struct GlobalLoadStream {
/// Ctor.
CUTLASS_DEVICE GlobalLoadStream(Params const& params,
SharedStorage& shared_storage,
dim3 const& block)
: stream_a(params.global_stream_a,
shared_storage.main_loop.stream_a.global,
cutlass::make_Coord(0, params.k, params.m),
cutlass::make_Coord(0, 0, block.x)),
stream_b(params.global_stream_b,
shared_storage.main_loop.stream_b.global,
cutlass::make_Coord(0, params.k, params.n),
make_Coord(0, 0, block.y)) {}
/// Trigger the copies from shared memory to registers.
CUTLASS_DEVICE void copy() {
stream_a.copy();
stream_b.copy();
}
/// Commit the data.
CUTLASS_DEVICE void commit() {
stream_a.commit();
stream_b.commit();
}
/// Execute the residue code.
CUTLASS_DEVICE void residue(Index k, bool skip_clear = false) {
stream_a.residue(k, skip_clear);
stream_b.residue(k, skip_clear);
}
/// The stream for A.
GlobalLoadStreamA stream_a;
/// The stream for B.
GlobalLoadStreamB stream_b;
};
/// Assemble the shared load stream for A/B.
struct SharedLoadStream {
/// Ctor.
CUTLASS_DEVICE SharedLoadStream(Params const& params, SharedStorage& shared_storage) {
stream_a.initialize(params.shared_stream_a, shared_storage.main_loop.stream_a.shared);
stream_b.initialize(params.shared_stream_b, shared_storage.main_loop.stream_b.shared);
}
/// Trigger the copies from shared memory to registers.
CUTLASS_DEVICE void copy(int step) {
stream_a.copy(step, fetched_a[step % 2]);
stream_b.copy(step, fetched_b[step % 2]);
}
/// Commit the data.
CUTLASS_DEVICE void commit(int step) {
stream_a.commit(fetched_a[step % 2], transformed_a[step % 2]);
stream_b.commit(fetched_b[step % 2], transformed_b[step % 2]);
}
/// The fragment A.
CUTLASS_DEVICE typename SharedLoadStreamA::Fragment const& fragment_a(int step) const {
return transformed_a[step % 2];
}
/// The fragment B.
CUTLASS_DEVICE typename SharedLoadStreamB::Fragment const& fragment_b(int step) const {
return transformed_b[step % 2];
}
/// Increment the stage.
CUTLASS_DEVICE void inc_stage() {
stream_a.inc_stage();
stream_b.inc_stage();
}
/// The stream for A.
SharedLoadStreamA stream_a;
/// The fragments to fetch A.
typename SharedLoadStreamA::FetchedFragment fetched_a[2];
/// The fragments to transform A.
typename SharedLoadStreamA::TransformedFragment transformed_a[2];
/// The stream for B.
SharedLoadStreamB stream_b;
/// The fragments to fetch B.
typename SharedLoadStreamB::FetchedFragment fetched_b[2];
/// The fragments to transform B.
typename SharedLoadStreamB::TransformedFragment transformed_b[2];
};
/// The memory fence for shared loads.
static CUTLASS_DEVICE void shared_load_fence(bool in_loop) {
if (SharedLoadStreamA::Iterator::kRequiresLoadFence ||
SharedLoadStreamB::Iterator::kRequiresLoadFence) {
__syncthreads();
}
}
/// The memory fence for shared stores.
static CUTLASS_DEVICE void shared_store_fence(bool in_loop) { __syncthreads(); }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmTileTraitsHelperA_, typename GemmTileTraitsHelperB_, typename Index_>
struct SimplifiedGemmTraitsHelper {
/// The global iterator to load A from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA_::GlobalTileTraits, Index_>
GlobalLoadIteratorA;
/// The data converter for A before storing to shared memory.
typedef Copy<typename GlobalLoadIteratorA::Fragment> GlobalTransformerA;
/// The iterator to store A to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperA_::SharedStoreTileTraits,
typename GemmTileTraitsHelperA_::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorA;
/// The stream to load A from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
GlobalLoadStreamA;
/// The global iterator to load B from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB_::GlobalTileTraits, Index_>
GlobalLoadIteratorB;
/// The data converter for B before storing to shared memory.
typedef Copy<typename GlobalLoadIteratorB::Fragment> GlobalTransformerB;
/// The iterator to store B to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperB_::SharedStoreTileTraits,
typename GemmTileTraitsHelperB_::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorB;
/// The stream to load B from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
GlobalLoadStreamB;
/// The iterator to load A from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperA_::SharedLoadTileTraits,
typename GemmTileTraitsHelperA_::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorA;
/// The stream to load A from shared memory.
typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
/// The iterator to load B from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperB_::SharedLoadTileTraits,
typename GemmTileTraitsHelperB_::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorB;
/// The stream to load B from shared memory.
typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The config for the GEMM.
typename GemmConfig_,
/// The epilogue.
typename Epilogue_,
/// The index.
typename Index_ = int,
// The configuration for the A matrix.
typename GemmTileTraitsHelperA_ = GemmTileTraitsHelperA<kLayoutA_, GemmConfig_>,
// The configuration for the B matrix.
typename GemmTileTraitsHelperB_ = GemmTileTraitsHelperB<kLayoutB_, GemmConfig_>,
// The helper class to create the streams and iterators.
typename Helper_ =
SimplifiedGemmTraitsHelper<GemmTileTraitsHelperA_, GemmTileTraitsHelperB_, Index_> >
struct SimplifiedGemmTraits : public GemmTraits<
// The config.
GemmConfig_,
// The stream to load A from global memory to shared memory.
typename Helper_::GlobalLoadStreamA,
// The stream to load B from global memory to shared memory.
typename Helper_::GlobalLoadStreamB,
// The stream to load A from shared memory.
typename Helper_::SharedLoadStreamA,
// The stream to load B from shared memory.
typename Helper_::SharedLoadStreamB,
// The epilogue.
Epilogue_,
// The block swizzle to reorganize the grid.
IdentityBlockSwizzle,
// The index.
Index_,
// The tool used to clear accumulators.
ClearAccumulators<typename GemmConfig_::Accumulators::Element> > {
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,90 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Tile traits used to construct global tile iterator for HGEMM. This is intended to
partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate
memory accesses larger than 16 bits.
*/
#pragma once
#include <cutlass/coord.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/matrix_traits.h>
#include <cutlass/reshape_tile.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <GemmOperand::Kind kOperand_,
MatrixLayout::Kind kLayout_,
typename Scalar_,
typename Tile_,
typename Threads_,
int kAccessSize_>
struct HgemmCrosswiseGlobalTileTraits : public GemmGlobalTileTraits<
// Which GEMM operand?
kOperand_,
// The layout.
kLayout_,
// The scalar.
Scalar_,
// The tile.
Tile_,
// The threads.
Threads_,
// The number of scalars per LDG/STG.
kAccessSize_> {
/// The base class.
typedef GemmGlobalTileTraits<kOperand_, kLayout_, Scalar_, Tile_, Threads_, kAccessSize_> Base;
/// The threads.
typedef typename Base::Threads Threads;
/// The threads strides.
typedef Shape<1, 2, Base::Tile::kC> ThreadsDelta;
/// The strides in each dimension between different loads/stores.
typedef Shape<Base::Threads::kH * 2, 1, Base::Threads::kW, Base::kAccessSize> Delta;
/// The number of iterations needed to load/store the tile.
typedef Shape<Base::Tile::kH / Base::Threads::kH / 2,
2,
Base::Tile::kW / Base::Threads::kW,
Base::Tile::kC / Base::kAccessSize>
Iterations;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
return make_Coord(0, thread_offset_h, thread_offset_w, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,104 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Specialization implementing multiply-add operation on half-precision floating point
fragments.
*/
#pragma once
#include <cutlass/fragment.h>
#include <cutlass/gemm/thread_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Template performing matrix multiply-add operation within a thread
template <typename AccumulatorsPerThread_, typename ThreadsPerWarp_>
struct ThreadMultiplyAdd<AccumulatorsPerThread_, ThreadsPerWarp_, half, half, half> {
/// The shape of the instruction.
typedef Shape<1, 1, 2, 1> InstructionShape;
/// The number of accumulators per thread.
typedef AccumulatorsPerThread_ AccumulatorsPerThread;
/// The number of threads per warp.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of accumulators per warp.
typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
/// The type for A.
typedef half ScalarA;
/// The fragment for A.
typedef Fragment<ScalarA, AccumulatorsPerThread::kW> FragmentA;
/// The type for B.
typedef half ScalarB;
/// The fragment for B.
typedef Fragment<ScalarB, AccumulatorsPerThread::kH> FragmentB;
/// The type for C and D.
typedef half ScalarC;
/// The accumulators.
typedef Fragment<half, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW> Accumulators;
/// Make sure there's an even number of elements in both dimensions.
static_assert(AccumulatorsPerThread::kH % 2 == 0, "Invalid size");
static_assert(AccumulatorsPerThread::kW % 2 == 0, "Invalid size");
/// Ctor.
CUTLASS_DEVICE ThreadMultiplyAdd() {}
/// Multiply : d = a*b + c.
CUTLASS_DEVICE void multiply_add(FragmentA const& a,
FragmentB const& b,
Accumulators const& c,
Accumulators& d) {
#if defined(__CUDACC__) && __CUDA_ARCH__ >= 530
// The inputs.
__half2 const* a_half2 = reinterpret_cast<__half2 const*>(&a[0]);
__half2 const* b_half2 = reinterpret_cast<__half2 const*>(&b[0]);
__half2 const* c_half2 = reinterpret_cast<__half2 const*>(&c[0]);
// The output.
__half2* d_half2 = reinterpret_cast<__half2*>(&d[0]);
for (int j = 0; j < AccumulatorsPerThread::kH / 2; ++j) {
for (int i = 0; i < AccumulatorsPerThread::kW / 2; ++i) {
// The offsets in the output fragment.
int const k0 = (2 * j + 0) * (AccumulatorsPerThread::kW / 2) + i;
int const k1 = (2 * j + 1) * (AccumulatorsPerThread::kW / 2) + i;
// Compute the product a[i] * b[j].H0_H0.
d_half2[k0] = __hfma2(a_half2[i], __low2half2(b_half2[j]), c_half2[k0]);
// Compute the product a[i] * b[j].H1_H1.
d_half2[k1] = __hfma2(a_half2[i], __high2half2(b_half2[j]), c_half2[k1]);
}
}
#endif
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,94 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in
shared memory for multiplicands.
*/
#pragma once
#include <cuda_fp16.h>
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GlobalIterator_>
struct HgemmSwizzle {
/// The global iterator.
typedef GlobalIterator_ GlobalIterator;
/// The source fragment.
typedef typename GlobalIterator::Fragment Fragment;
/// The shape of the source fragment.
typedef typename GlobalIterator::FragmentShape FragmentShape;
/// The input fragment.
typedef Fragment InputFragment;
/// The output fragment.
typedef Fragment OutputFragment;
/// The src/dst must be half fragments.
static_assert((platform::is_same<typename Fragment::Element, half>::value), "Works on half");
/// The number of elements must be a multiple of 2.
static_assert(FragmentShape::kH == 2 && ShapeCount<FragmentShape>::kWc == 2, "Not multiple of 2");
/// Ctor.
CUTLASS_DEVICE HgemmSwizzle() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(Fragment const& src, Fragment& dst) {
// Expose src/dst as int arrays.
int const* src_int = reinterpret_cast<int const*>(&src[0]);
int* dst_int = reinterpret_cast<int*>(&dst[0]);
// Transpose the data.
for (int d = 0; d < FragmentShape::kD; ++d) {
// The indices to read two consecutive "rows".
int const i0 = 2 * d + 0;
int const i1 = 2 * d + 1;
int a0 = src_int[i0];
int a1 = src_int[i1];
int b0, b1;
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b0) : "r"(a0), "r"(a1));
asm volatile("prmt.b32 %0, %1, %2, 0x7632;" : "=r"(b1) : "r"(a0), "r"(a1));
// The indices to store with "strides".
int const j0 = 0 * (ShapeCount<FragmentShape>::kDhw / 2) + d;
int const j1 = 1 * (ShapeCount<FragmentShape>::kDhw / 2) + d;
dst_int[j0] = b0;
dst_int[j1] = b1;
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,391 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defies structural properties of half-precision GEMM computation.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/gemm_epilogue.h>
#include <cutlass/gemm/gemm_epilogue_traits.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/gemm/gemm_shared_tile.h>
#include <cutlass/gemm/gemm_traits.h>
#include <cutlass/gemm/hgemm_global_tile.h>
#include <cutlass/gemm/hgemm_multiply_add.h>
#include <cutlass/gemm/hgemm_swizzle.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The tile size for the GEMM KxNxM.
typename OutputTile_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_ = 2,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_ = 2>
struct HgemmConfig
: public GemmConfig<
/// The scalar type for A.
half,
/// The scalar type for B.
half,
/// The scalar type for C.
half,
/// The scalar type for D.
half,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, half, half, half>,
/// The number of scalars per LDG for A.
kScalarsPerLdgA_,
/// The number of scalars per STS for A.
kScalarsPerLdgA_,
/// The number of scalars per LDS for A.
8,
/// The number of scalars per LDG for B.
kScalarsPerLdgB_,
/// The number of scalars per STS for B.
kScalarsPerLdgB_,
/// The number of scalars per LDS for B.
8,
/// The number of scalars per LDG for C and STG for D.
2,
/// The number of scalars per STS for D.
8,
/// The number of scalars per LDS for D.
2,
/// The number of stages in shared memory.
2> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
struct HgemmTransformerA {};
template <typename Iterator_>
struct HgemmTransformerA<MatrixLayout::kColumnMajor, Iterator_> {
typedef Convert<typename Iterator_::Fragment, typename Iterator_::Fragment> Transformer;
};
template <typename Iterator_>
struct HgemmTransformerA<MatrixLayout::kRowMajor, Iterator_> {
typedef HgemmSwizzle<Iterator_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
struct HgemmTransformerB {};
template <typename Iterator_>
struct HgemmTransformerB<MatrixLayout::kRowMajor, Iterator_> {
typedef Convert<typename Iterator_::Fragment, typename Iterator_::Fragment> Transformer;
};
template <typename Iterator_>
struct HgemmTransformerB<MatrixLayout::kColumnMajor, Iterator_> {
typedef HgemmSwizzle<Iterator_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct HgemmTileTraitsHelperA : public GemmTileTraitsHelperA<kLayout_, GemmConfig_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct HgemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_>
: public GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> Base;
/// The traits class to build the iterator to load data from global memory for A^T.
typedef HgemmCrosswiseGlobalTileTraits<
GemmOperand::kA,
// The layout.
MatrixLayout::kRowMajor,
// The pointer.
half const,
// The tile has size MxK in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
// The threads are distributed as (threads / K ) x K (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc)
GemmConfig_::kScalarsPerLdgA>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for A^T.
typedef GemmSharedStoreWithSkewTileAbTraits<
// The pointer.
half,
// The tile has size KxM in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kW * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as warps x 32(the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
2,
// The skew to avoid bank conflicts added in the tile W dimension.
128 / sizeof(half) / GlobalTileTraits::Threads::kW / 2>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for A^T.
typedef GemmSharedLoadTileATraits<
// The pointer.
half const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
8,
// The skew.
SharedStoreTileTraits::kSkew>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct HgemmTileTraitsHelperB : public GemmTileTraitsHelperB<kLayout_, GemmConfig_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct HgemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_>
: public GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> Base;
/// The traits class to build the iterator to load data from global memory for B^N.
typedef HgemmCrosswiseGlobalTileTraits<
GemmOperand::kB,
// The layout.
MatrixLayout::kColumnMajor,
// The pointer.
half const,
// The tile has size KxN in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc)
GemmConfig_::kScalarsPerLdgB>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for B^N.
typedef GemmSharedStoreWithSkewTileAbTraits<
// The pointer.
half,
// The tile has size KxN in GEMM's terminology.
Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD / GemmConfig_::InstructionShape::kD,
GemmConfig_::OutputTile::kH * GemmConfig_::InstructionShape::kD>,
// The threads are distributed as (threads / K) x K (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
2,
// The skew to avoid bank conflicts added in the tile W dimension.
128 / sizeof(half) / GlobalTileTraits::Threads::kW / 2>
SharedStoreTileTraits;
/// The traits class to build the iterator to load from shared memory for B^N.
typedef GemmSharedLoadTileBTraits<
// The pointer.
half const,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The number of threads per warp.
typename GemmConfig_::MultiplyAdd::ThreadsPerWarp,
// The shape of the FMA instruction.
typename GemmConfig_::InstructionShape,
// The number of stages.
GemmConfig_::kStages,
// The number of scalars per LDS.
8,
// The skew.
SharedStoreTileTraits::kSkew>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
/// The number of halfs loaded in one LDG for A.
int kScalarsPerLdgA_ = 2,
/// The number of halfs loaded in one LDG for B.
int kScalarsPerLdgB_ = 2,
/// The index.
typename Index_ = int>
struct HgemmTraitsHelper {
/// The HGEMM config.
typedef HgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>
GemmConfig;
/// The GEMM config for A.
typedef HgemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
/// The GEMM config for B.
typedef HgemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
/// The iterator to load A from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
GlobalLoadIteratorA;
/// The default transformer for A.
typedef typename HgemmTransformerA<GemmTileTraitsHelperA::kLayout,
GlobalLoadIteratorA>::Transformer GlobalTransformerA;
/// The iterator to store A to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorA;
/// The stream to load A from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
GlobalLoadStreamA;
/// The iterator to load B from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
GlobalLoadIteratorB;
// The default transformer for B.
typedef typename HgemmTransformerB<GemmTileTraitsHelperB::kLayout,
GlobalLoadIteratorB>::Transformer GlobalTransformerB;
/// The iterator to store B to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorB;
/// The stream to load B from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
GlobalLoadStreamB;
/// The iterator to load A from shared memory
typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorA;
/// The stream to load A from shared memory.
typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
/// The iterator to load B from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorB;
/// The stream to load B from shared memory.
typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
/// The functor to do the multiply-add in the main loop.
typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
/// The object to clear accumulators.
typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
/// The traits class for the epilogue.
typedef SimplifiedGemmEpilogueTraits<GemmConfig, EpilogueFunctor_, Index_> GemmEpilogueTraits;
/// The epilogue.
typedef GemmEpilogue<GemmEpilogueTraits> Epilogue;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_ = Shape<8, 128, 128>,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_ = LinearScaling<half>,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<8, 8, 16>,
/// The number of halfs loaded in one LDG for A.
int kScalarsPerLdgA_ = 2,
/// The number of halfs loaded in one LDG for B.
int kScalarsPerLdgB_ = 2,
/// The index.
typename Index_ = int,
/// The helper class.
typename Helper_ = HgemmTraitsHelper<kLayoutA_,
kLayoutB_,
OutputTile_,
EpilogueFunctor_,
AccumulatorsPerThread_,
kScalarsPerLdgA_,
kScalarsPerLdgB_,
Index_> >
struct HgemmTraits : public GemmTraits<
// The config.
typename Helper_::GemmConfig,
// The stream to load A from global memory to shared memory.
typename Helper_::GlobalLoadStreamA,
// The stream to load B from global memory to shared memory.
typename Helper_::GlobalLoadStreamB,
// The stream to load A from shared memory.
typename Helper_::SharedLoadStreamA,
// The stream to load B from shared memory.
typename Helper_::SharedLoadStreamB,
// The epilogue.
typename Helper_::Epilogue,
// The block swizzle to reorganize the grid.
IdentityBlockSwizzle,
// The index.
Index_,
// The tool used to clear accumulators.
typename Helper_::ClearAccumulators> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,48 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defies functors for mapping blockIdx to partitions of the GEMM computation.
Currently, we only implement an identity mapping.
*/
#pragma once
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
struct IdentityBlockSwizzle {
/// Ctor.
CUTLASS_DEVICE IdentityBlockSwizzle() {}
/// Swizzle the block index.
CUTLASS_DEVICE dim3 swizzle() { return blockIdx; }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,320 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and
floating-point output matrix formats.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/fragment.h>
#include <cutlass/gemm/gemm_global_stream.h>
#include <cutlass/gemm/gemm_shared_stream.h>
#include <cutlass/gemm/igemm_global_tile.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/tile_iterator.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <int kElements_>
struct IgemmFloatToInt8Converter {
/// The input fragment.
typedef Fragment<float, kElements_> InputFragment;
/// The output fragment.
typedef Fragment<int8_t, kElements_> OutputFragment;
// We are packing 4 floats into int32 registers so we need kElements to be multiple of 4.
static_assert(kElements_ % 4 == 0, "kElements must be multiple of 4");
/// Ctor.
CUTLASS_DEVICE IgemmFloatToInt8Converter() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
transform(src, 0, dst);
}
/// Transform a fragment.
template <typename Fragment_>
CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
// The inputs.
float4 const* src_f4 = reinterpret_cast<float4 const*>(&src[0]);
// The outputs.
int* dst_int = reinterpret_cast<int*>(&dst[0]);
// Iterate over the floats and pack them together to produce ints.
for (int i = 0; i < kElements_ / 4; ++i) {
// Read the float4.
float4 f4 = src_f4[i];
// Clamp the 4 elements of the floats to the [-128, +127] range.
float x = fmaxf(-128.f, fminf(127.f, f4.x));
float y = fmaxf(-128.f, fminf(127.f, f4.y));
float z = fmaxf(-128.f, fminf(127.f, f4.z));
float w = fmaxf(-128.f, fminf(127.f, f4.w));
// Convert to integers.
int ix = (int)x;
int iy = (int)y;
int iz = (int)z;
int iw = (int)w;
// Extract the lower bytes to build an int32 with 4 int8.
asm volatile("prmt.b32 %0, %0, %1, 0x1140;" : "+r"(ix) : "r"(iy));
asm volatile("prmt.b32 %0, %0, %1, 0x1140;" : "+r"(iz) : "r"(iw));
asm volatile("prmt.b32 %0, %0, %1, 0x5410;" : "+r"(ix) : "r"(iz));
// Store the int.
dst_int[i] = ix;
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename InputScalar_, typename OutputFragment_>
struct IgemmGlobalStoreTransformer {
typedef Convert<Fragment<InputScalar_, OutputFragment_::kElements>, OutputFragment_> Transformer;
};
template <int kElements_>
struct IgemmGlobalStoreTransformer<float, Fragment<int8_t, kElements_> > {
typedef IgemmFloatToInt8Converter<kElements_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <int kElements_>
struct IgemmInt8ToFloatConverter {
/// The input fragment.
typedef Fragment<int8_t, kElements_> InputFragment;
/// The output fragment.
typedef Fragment<float, kElements_> OutputFragment;
// We are unpacking 4 int8s from int32.
static_assert(kElements_ % 4 == 0, "kElements must be multiple of 4");
/// Ctor.
CUTLASS_DEVICE IgemmInt8ToFloatConverter() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(InputFragment const& src, OutputFragment& dst) {
transform(src, 0, dst);
}
/// Transform a fragment.
template <typename Fragment_>
CUTLASS_DEVICE void transform(Fragment_ const& src, int offset, OutputFragment& dst) {
// The inputs.
int const* src_int = reinterpret_cast<int const*>(&src[0]);
// The outputs.
float4* dst_f4 = reinterpret_cast<float4*>(&dst[0]);
// Iterate over the int8 and unpack them together to produce floats.
for (int i = 0; i < kElements_ / 4; ++i) {
// Read the int.
int ix, iy, iz, iw = src_int[i];
// Extract the 4 bytes.
asm volatile("prmt.b32 %0, 0x0, %1, 0x4440;" : "=r"(ix) : "r"(iw));
asm volatile("prmt.b32 %0, 0x0, %1, 0x4441;" : "=r"(iy) : "r"(iw));
asm volatile("prmt.b32 %0, 0x0, %1, 0x4442;" : "=r"(iz) : "r"(iw));
asm volatile("prmt.b32 %0, 0x0, %1, 0x4443;" : "=r"(iw) : "r"(iw));
// The floats.
float fx, fy, fz, fw;
// Convert to floats (make sure we generate I2F.F32.S8).
asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fx) : "r"(ix));
asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fy) : "r"(iy));
asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fz) : "r"(iz));
asm volatile("cvt.rn.f32.s8 %0, %1;" : "=f"(fw) : "r"(iw));
// Store the float4.
dst_f4[i] = make_float4(fx, fy, fz, fw);
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename InputFragment_, typename OutputScalar_>
struct IgemmGlobalLoadTransformer {
typedef Convert<InputFragment_, Fragment<OutputScalar_, InputFragment_::kElements> > Transformer;
};
template <int kElements_>
struct IgemmGlobalLoadTransformer<Fragment<int8_t, kElements_>, float> {
typedef IgemmInt8ToFloatConverter<kElements_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename InputScalar_, typename OutputFragment_>
struct IgemmSharedStoreTransformer {
typedef Convert<Fragment<InputScalar_, OutputFragment_::kElements>, OutputFragment_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename IgemmConfig_, typename EpilogueFunctor_, typename Index_>
struct IgemmEpilogueTraitsHelper
: public GemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> {
/// The base class.
typedef GemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> Base;
/// The config.
typedef IgemmConfig_ IgemmConfig;
/// The scalar type of the epilogue.
typedef typename Base::Scalar Scalar;
/// The iterations.
typedef typename Base::Iterations Iterations;
/// The iterations strides.
typedef typename Base::Delta Delta;
/// The traits class for the iterator.
typedef typename Base::GlobalLoadTileTraits GlobalLoadTileTraits;
/// The iterator to store to shared memory.
typedef GemmGlobalIteratorCd<GlobalLoadTileTraits> GlobalLoadIteratorC;
/// The fragment that needs to be produced by the load iterator.
typedef typename GlobalLoadIteratorC::Fragment GlobalFragmentC;
/// The transformer from loaded data to math fragment.
typedef
typename IgemmGlobalLoadTransformer<GlobalFragmentC, Scalar>::Transformer GlobalTransformerC;
/// The traits class for the iterator.
typedef typename Base::GlobalStoreTileTraits GlobalStoreTileTraits;
/// The iterator to store to shared memory.
typedef GemmGlobalIteratorCd<GlobalStoreTileTraits> GlobalStoreIteratorD;
/// The fragment that needs to be passed to that store iterator.
typedef typename GlobalStoreIteratorD::Fragment GlobalFragmentD;
/// The transformer from accumulators to shared memory fragments.
typedef
typename IgemmGlobalStoreTransformer<Scalar, GlobalFragmentD>::Transformer GlobalTransformerD;
/// The traits class for the shared iterator to store D to shared memory.
typedef typename Base::SharedStoreTileTraits SharedStoreTileTraits;
/// The shared iterator to store D to shared memory.
typedef TileStoreIterator<SharedStoreTileTraits,
typename SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kGlobal>
SharedStoreIteratorD;
/// The fragment that needs to be passed to that store iterator.
typedef typename SharedStoreIteratorD::Fragment SharedStoreFragmentD;
/// The transformer from accumulators to shared memory fragments.
typedef typename IgemmSharedStoreTransformer<typename IgemmConfig::Accumulators::Element,
SharedStoreFragmentD>::Transformer
SharedStoreTransformerD;
/// The traits class for the shared iterator to load D from shared memory.
typedef typename Base::SharedLoadTileTraits SharedLoadTileTraits;
/// The shared iterator to load D from shared memory.
typedef TileLoadIterator<SharedLoadTileTraits,
typename SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorD;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The config.
typename IgemmConfig_,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_,
/// The index.
typename Index_ = int,
/// The helper class to assemble the traits.
typename Helper_ = IgemmEpilogueTraitsHelper<IgemmConfig_, EpilogueFunctor_, Index_> >
struct IgemmEpilogueTraits : public GemmEpilogueTraits<
// The output tile.
typename IgemmConfig_::OutputTile,
// The accumulators.
typename IgemmConfig_::Accumulators,
// The global iterator for C.
typename Helper_::GlobalLoadIteratorC,
// The transformer for C.
typename Helper_::GlobalTransformerC,
// The transformer for D.
typename Helper_::GlobalTransformerD,
// The global iterator for D.
typename Helper_::GlobalStoreIteratorD,
// The iterator to store D to shared memory.
typename Helper_::SharedStoreIteratorD,
// The shared store transformer for D.
typename Helper_::SharedStoreTransformerD,
// The iterator to load D from shared memory.
typename Helper_::SharedLoadIteratorD,
// The iterations.
typename Helper_::Iterations,
// The strides between iterations.
typename Helper_::Delta,
// The functor to be used in the epilogue.
EpilogueFunctor_,
// The index.
Index_> {
/// Do we output in int8?
static bool const kInt8Output =
platform::is_same<typename IgemmConfig_::ScalarC, int8_t>::value != 0;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmEpilogueTraits_, bool = GemmEpilogueTraits_::kInt8Output>
struct IgemmEpilogue : public GemmEpilogue<GemmEpilogueTraits_> {
/// The base class.
typedef GemmEpilogue<GemmEpilogueTraits_> Base;
/// Ctor.
CUTLASS_DEVICE IgemmEpilogue(typename Base::Params const& params_,
typename Base::SharedStorage& shared_storage_,
typename Base::Index m_,
typename Base::Index n_)
: Base(params_, shared_storage_, m_, n_) {}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmEpilogueTraits_>
struct IgemmEpilogue<GemmEpilogueTraits_, true> : public GemmEpilogue<GemmEpilogueTraits_> {
/// The base class.
typedef GemmEpilogue<GemmEpilogueTraits_> Base;
/// Ctor.
CUTLASS_DEVICE IgemmEpilogue(typename Base::Params const& params_,
typename Base::SharedStorage& shared_storage_,
typename Base::Index m_,
typename Base::Index n_)
: Base(params_, shared_storage_, m_, n_) {}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,95 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements tile iterators to partition the thread block tile into 2D subtiles and
efficiently load each. Applies permute transformation to construct 'interleaved K-strided'
data layout in which 4-element dot products from the same K index are arranged in consecutive
locations within shared memory.
Supports efficient loads from shared memory to target the DP4A instruction.
*/
#pragma once
#include <cutlass/coord.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/matrix_traits.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <GemmOperand::Kind kOperand_,
MatrixLayout::Kind kLayout_,
typename Scalar_,
typename Tile_,
typename Threads_,
int kAccessSize_>
struct IgemmContiguousGlobalTileTraits : public GemmGlobalTileTraits<
// Which GEMM operand?
kOperand_,
// The layout.
kLayout_,
// The scalar.
Scalar_,
// The tile.
Tile_,
// The threads.
Threads_,
// The number of scalars per LDG/STG.
kAccessSize_> {
/// The base class.
typedef GemmGlobalTileTraits<kOperand_, kLayout_, Scalar_, Tile_, Threads_, kAccessSize_> Base;
/// The threads.
typedef typename Base::Threads Threads;
/// The strides in each dimension between different loads/stores.
typedef Shape<Base::Threads::kH * 4, 1, Base::Threads::kW, Base::kAccessSize> Delta;
/// The number of iterations needed to load/store the tile.
typedef Shape<Base::Tile::kH / Base::Threads::kH / 4,
4,
Base::Tile::kW / Base::Threads::kW,
Base::Tile::kC / Base::kAccessSize>
Iterations;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int thread_offset_h = threadIdx.x / Threads::kW * ThreadsDelta::kH;
int thread_offset_w = threadIdx.x % Threads::kW * ThreadsDelta::kW;
return make_Coord(0, thread_offset_h, thread_offset_w, 0);
}
};
public:
/// The threads strides.
typedef Shape<1, 4, Base::Tile::kC> ThreadsDelta;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,89 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements matrix multiply accumulate operation of 8-bit integer data using DP4A
instruction.
*/
#pragma once
#include <cutlass/fragment.h>
#include <cutlass/gemm/thread_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Template performing matrix multiply-add operation within a thread
template <typename AccumulatorsPerThread_, typename ThreadsPerWarp_>
struct ThreadMultiplyAdd<AccumulatorsPerThread_, ThreadsPerWarp_, int8_t, int8_t, int> {
/// The shape of the instruction.
typedef Shape<4, 1, 1> InstructionShape;
/// The number of accumulators per thread.
typedef AccumulatorsPerThread_ AccumulatorsPerThread;
/// The number of threads per warp.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of accumulators per warp.
typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
/// The type for A.
typedef int8_t ScalarA;
/// The fragment for A.
typedef Fragment<ScalarA, AccumulatorsPerThread::kW * 4> FragmentA;
/// The type for B.
typedef int8_t ScalarB;
/// The fragment for B.
typedef Fragment<ScalarB, AccumulatorsPerThread::kH * 4> FragmentB;
/// The type for C and D.
typedef int ScalarC;
/// The accumulators.
typedef Fragment<ScalarC, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW> Accumulators;
/// Ctor.
CUTLASS_DEVICE ThreadMultiplyAdd() {}
/// Multiply : d = a*b + c.
CUTLASS_DEVICE void multiply_add(FragmentA const& a,
FragmentB const& b,
Accumulators const& c,
Accumulators& d) {
// The inputs.
int const* a_int = reinterpret_cast<int const*>(&a[0]);
int const* b_int = reinterpret_cast<int const*>(&b[0]);
for (int j = 0; j < AccumulatorsPerThread::kH; ++j) {
for (int i = 0; i < AccumulatorsPerThread::kW; ++i) {
asm volatile("dp4a.s32.s32 %0, %1, %2, %3;"
: "=r"(d[j * AccumulatorsPerThread::kW + i])
: "r"(a_int[i]), "r"(b_int[j]), "r"(c[j * AccumulatorsPerThread::kW + i]));
}
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,115 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Transposes a fragment of data containing packed 8-bit integer elements.
*/
#pragma once
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GlobalIterator_>
struct IgemmSwizzle {
/// The global iterator.
typedef GlobalIterator_ GlobalIterator;
/// The source fragment.
typedef typename GlobalIterator::Fragment Fragment;
/// The shape of the source fragment.
typedef typename GlobalIterator::FragmentShape FragmentShape;
/// The source fragment.
typedef Fragment InputFragment;
/// The destination fragment.
typedef Fragment OutputFragment;
/// The src/dst must be int8 fragments.
static_assert((platform::is_same<typename Fragment::Element, int8_t>::value), "Works on int8");
/// The number of elements must be a multiple of 4.
static_assert(FragmentShape::kH % 4 == 0 && ShapeCount<FragmentShape>::kWc % 4 == 0,
"Not multiple of 4");
/// Ctor.
CUTLASS_DEVICE IgemmSwizzle() {}
/// Transform a fragment.
CUTLASS_DEVICE void transform(Fragment const& src, Fragment& dst) {
// Expose src/dst as int arrays.
int const* src_int = reinterpret_cast<int const*>(&src[0]);
int* dst_int = reinterpret_cast<int*>(&dst[0]);
// Transpose the data.
for (int d = 0; d < FragmentShape::kD; ++d) {
for (int h = 0; h < FragmentShape::kH / 4; ++h) {
for (int w = 0; w < ShapeCount<FragmentShape>::kWc / 4; ++w) {
int const i0 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
(4 * h + 0) * (ShapeCount<FragmentShape>::kWc / 4) + w;
int const i1 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
(4 * h + 1) * (ShapeCount<FragmentShape>::kWc / 4) + w;
int const i2 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
(4 * h + 2) * (ShapeCount<FragmentShape>::kWc / 4) + w;
int const i3 = d * (ShapeCount<FragmentShape>::kHwc / 4) +
(4 * h + 3) * (ShapeCount<FragmentShape>::kWc / 4) + w;
int a0 = src_int[i0];
int a1 = src_int[i1];
int a2 = src_int[i2];
int a3 = src_int[i3];
int b0, b1, b2, b3, c0;
asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(b0) : "r"(a0), "r"(a1));
asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(c0) : "r"(a2), "r"(a3));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b0) : "r"(b0), "r"(c0));
asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(b1) : "r"(a0), "r"(a1));
asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(c0) : "r"(a2), "r"(a3));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b1) : "r"(b1), "r"(c0));
asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(b2) : "r"(a0), "r"(a1));
asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(c0) : "r"(a2), "r"(a3));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b2) : "r"(b2), "r"(c0));
asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(b3) : "r"(a0), "r"(a1));
asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(c0) : "r"(a2), "r"(a3));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(b3) : "r"(b3), "r"(c0));
dst_int[i0] = b0;
dst_int[i1] = b1;
dst_int[i2] = b2;
dst_int[i3] = b3;
}
}
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,393 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed
to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output
formats vary.
*/
#pragma once
#include <cutlass/convert.h>
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/gemm_epilogue.h>
#include <cutlass/gemm/gemm_epilogue_traits.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/gemm/gemm_shared_tile.h>
#include <cutlass/gemm/gemm_traits.h>
#include <cutlass/gemm/igemm_epilogue.h>
#include <cutlass/gemm/igemm_global_tile.h>
#include <cutlass/gemm/igemm_multiply_add.h>
#include <cutlass/gemm/igemm_swizzle.h>
#include <cutlass/reshape_tile.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The tile size for the GEMM KxNxM.
typename OutputTile_,
/// The output type.
typename ScalarD_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_>
struct IgemmConfig
: public GemmConfig<
/// The scalar type for A.
int8_t,
/// The scalar type for B.
int8_t,
/// The scalar type for C.
ScalarD_,
/// The scalar type for D.
ScalarD_,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, int8_t, int8_t, int>,
/// The number of scalars per LDG for A.
4,
/// The number of scalars per STS for A.
4,
/// The number of scalars per LDS for A.
16,
/// The number of scalars per LDG for B.
4,
/// The number of scalars per STS for B.
4,
/// The number of scalars per LDS for B.
16,
/// The number of scalars per LDG for C and STG for D.
1,
/// The number of scalars per STS for D.
4,
/// The number of scalars per LDS for D.
1,
/// The number of stages in shared memory.
2> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename OutputTile_, typename AccumulatorsPerThread_>
struct IgemmConfig<OutputTile_, int8_t, AccumulatorsPerThread_>
: public GemmConfig<
/// The scalar type for A.
int8_t,
/// The scalar type for B.
int8_t,
/// The scalar type for C.
int8_t,
/// The scalar type for D.
int8_t,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, int8_t, int8_t, int>,
/// The number of scalars per LDG for A.
4,
/// The number of scalars per STS for A.
4,
/// The number of scalars per LDS for A.
16,
/// The number of scalars per LDG for B.
4,
/// The number of scalars per STS for B.
4,
/// The number of scalars per LDS for B.
16,
/// The number of scalars per LDG for C and STG for D.
4,
/// The number of scalars per STS for D.
4,
/// The number of scalars per LDS for D.
4,
/// The number of stages in shared memory.
2> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct IgemmTileTraitsHelperA : public GemmTileTraitsHelperA<kLayout_, GemmConfig_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct IgemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_>
: public GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> Base;
/// The number of scalars per LDG/STS/LDS for A.
static int const kScalarsPerStsA = 16;
/// The traits class to build the iterator to load data from global memory for A^N.
typedef IgemmContiguousGlobalTileTraits<
GemmOperand::kA,
// The layout.
MatrixLayout::kColumnMajor,
// The pointer is float const.
int8_t const,
// The tile has size KxM in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
4>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for A^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer is float.
int8_t,
// The tile has size KxM in GEMM's terminology.
Shape<GemmConfig_::kStages, GemmConfig_::OutputTile::kD / 4, GemmConfig_::OutputTile::kW * 4>,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
kScalarsPerStsA>
SharedStoreTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct IgemmTileTraitsHelperB : public GemmTileTraitsHelperB<kLayout_, GemmConfig_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct IgemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_>
: public GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> Base;
/// The number of scalars per LDG/STS/LDS for B.
static int const kScalarsPerStsB = 16;
/// The traits class to build the iterator to load data from global memory for B^T.
typedef IgemmContiguousGlobalTileTraits<
GemmOperand::kB,
// The layout.
MatrixLayout::kRowMajor,
// The pointer is float const.
int8_t const,
// The tile has size KxM in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kH>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
4>
GlobalTileTraits;
/// The traits class to build the iterator to store data to shared memory for B^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer is float.
int8_t,
// The tile has size KxM in GEMM's terminology.
Shape<GemmConfig_::kStages, GemmConfig_::OutputTile::kD / 4, GemmConfig_::OutputTile::kH * 4>,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
kScalarsPerStsB>
SharedStoreTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
struct IgemmTransformerA {};
template <typename Iterator_>
struct IgemmTransformerA<MatrixLayout::kRowMajor, Iterator_> {
typedef Copy<typename Iterator_::Fragment> Transformer;
};
template <typename Iterator_>
struct IgemmTransformerA<MatrixLayout::kColumnMajor, Iterator_> {
typedef IgemmSwizzle<Iterator_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename Iterator_>
struct IgemmTransformerB {};
template <typename Iterator_>
struct IgemmTransformerB<MatrixLayout::kColumnMajor, Iterator_> {
typedef Copy<typename Iterator_::Fragment> Transformer;
};
template <typename Iterator_>
struct IgemmTransformerB<MatrixLayout::kRowMajor, Iterator_> {
typedef IgemmSwizzle<Iterator_> Transformer;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_,
/// The output type.
typename ScalarD_,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
/// The index.
typename Index_ = int>
struct IgemmTraitsHelper {
/// The IGEMM config.
typedef IgemmConfig<OutputTile_, ScalarD_, AccumulatorsPerThread_> GemmConfig;
/// The GEMM config for A.
typedef IgemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
/// The GEMM config for B.
typedef IgemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
/// The iterator to load A from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
GlobalLoadIteratorA;
/// The default transformer for A.
typedef typename IgemmTransformerA<GemmTileTraitsHelperA::kLayout,
GlobalLoadIteratorA>::Transformer GlobalTransformerA;
/// The iterator to store A to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorA;
/// The stream to load A from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
GlobalLoadStreamA;
/// The iterator to load B from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
GlobalLoadIteratorB;
// The default transformer for B.
typedef typename IgemmTransformerB<GemmTileTraitsHelperB::kLayout,
GlobalLoadIteratorB>::Transformer GlobalTransformerB;
/// The iterator to store B to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorB;
/// The stream to load B from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
GlobalLoadStreamB;
/// The iterator to load A from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorA;
/// The stream to load A from shared memory.
typedef SharedLoadStream<SharedLoadIteratorA, Copy<typename SharedLoadIteratorA::Fragment> >
SharedLoadStreamA;
/// The iterator to load B from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorB;
/// The stream to load B from shared memory.
typedef SharedLoadStream<SharedLoadIteratorB, Copy<typename SharedLoadIteratorB::Fragment> >
SharedLoadStreamB;
/// The multiply-add functor.
typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
/// The object to clear accumulators.
typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
/// The epilogue.
typedef IgemmEpilogue<IgemmEpilogueTraits<GemmConfig, EpilogueFunctor_> > Epilogue;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename ScalarD_>
struct IgemmEpilogueScalar {
typedef float Scalar;
};
template <>
struct IgemmEpilogueScalar<int> {
typedef int Scalar;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_ = Shape<32, 128, 128>,
/// The output type.
typename ScalarD_ = int,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_ = LinearScaling<typename IgemmEpilogueScalar<ScalarD_>::Scalar>,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<32, 8, 8>,
/// The index.
typename Index_ = int,
/// The helper class.
typename Helper_ = IgemmTraitsHelper<kLayoutA_,
kLayoutB_,
OutputTile_,
ScalarD_,
EpilogueFunctor_,
AccumulatorsPerThread_,
Index_> >
struct IgemmTraits : public GemmTraits<
// The config.
typename Helper_::GemmConfig,
// The stream to load A from global memory to shared memory.
typename Helper_::GlobalLoadStreamA,
// The stream to load B from global memory to shared memory.
typename Helper_::GlobalLoadStreamB,
// The stream to load A from shared memory.
typename Helper_::SharedLoadStreamA,
// The stream to load B from shared memory.
typename Helper_::SharedLoadStreamB,
// The epilogue.
typename Helper_::Epilogue,
// The block swizzle to reorganize the grid.
IdentityBlockSwizzle,
// The index.
Index_,
// The tool used to clear accumulators.
typename Helper_::ClearAccumulators> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,86 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements the BLAS linear scaling function alpha*AB + beta*C
*/
#pragma once
#include <cutlass/fragment_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Functor to compute linear combination of fragments
template <typename Scalar_, typename FragmentMultiplyAdd_ = FragmentMultiplyAdd<Scalar_> >
struct LinearScaling {
// The scalar.
typedef Scalar_ Scalar;
// The adapater.
typedef FragmentMultiplyAdd_ FragmentMultiplyAdd;
/// The parameters.
struct Params {
/// The alpha/beta scaling params.
Scalar alpha, beta;
/// Initialize the parameters.
template <typename GemmDesc_>
CUTLASS_HOST_DEVICE int initialize(GemmDesc_ const& desc) {
alpha = desc.alpha;
beta = desc.beta;
return 0;
}
};
/// Ctor.
CUTLASS_DEVICE LinearScaling(Params const& params) : alpha(params.alpha), beta(params.beta) {}
/// Evaluate the functor.
template <typename Fragment_>
CUTLASS_DEVICE void evaluate(Fragment_ const& accum, Fragment_& output) {
FragmentMultiplyAdd mad;
mad.multiply(alpha, accum, output);
}
/// Evaluate the functor.
template <typename Fragment_>
CUTLASS_DEVICE void evaluate(Fragment_ const& accum, Fragment_ const& old, Fragment_& output) {
FragmentMultiplyAdd mad;
Fragment_ tmp;
mad.multiply(beta, old, tmp);
mad.multiply_add(alpha, accum, tmp, output);
}
/// The alpha/beta scaling factors.
Scalar alpha, beta;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,127 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defies structural properties of single-precision GEMM.
*/
#pragma once
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/gemm_epilogue.h>
#include <cutlass/gemm/gemm_epilogue_traits.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/gemm/gemm_shared_tile.h>
#include <cutlass/gemm/gemm_traits.h>
#include <cutlass/gemm/thread_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The tile size for the GEMM KxNxM.
typename OutputTile_,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_ = 1,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_ = 1>
struct SgemmConfig
: public GemmConfig<
/// The scalar type for A.
float,
/// The scalar type for B.
float,
/// The scalar type for C.
float,
/// The scalar type for D.
float,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
ThreadMultiplyAdd<AccumulatorsPerThread_, Shape<1, 4, 8>, float, float, float>,
/// The number of scalars per LDG for A.
kScalarsPerLdgA_,
/// The number of scalars per STS for A.
kScalarsPerLdgA_,
/// The number of scalars per LDS for A.
4,
/// The number of scalars per LDG for B.
kScalarsPerLdgB_,
/// The number of scalars per STS for B.
kScalarsPerLdgB_,
/// The number of scalars per LDS for B.
4,
/// The number of scalars per LDG for C and STG for D.
1,
/// The number of scalars per STS for D.
4,
/// The number of scalars per LDS for D.
1,
/// The number of stages in shared memory.
2> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_ = Shape<8, 128, 128>,
/// The functor to use in the epilogue.
typename EpilogueFunctor_ = LinearScaling<float>,
/// The number of accumulators per thread.
typename AccumulatorsPerThread_ = Shape<8, 8, 8>,
/// The number of floats loaded in one LDG for A.
int kScalarsPerLdgA_ = 1,
/// The number of floats loaded in one LDG for B.
int kScalarsPerLdgB_ = 1,
/// The index.
typename Index_ = int,
/// The SGEMM config.
typename GemmConfig_ =
SgemmConfig<OutputTile_, AccumulatorsPerThread_, kScalarsPerLdgA_, kScalarsPerLdgB_>,
/// The traits class for the epilogue.
typename GemmEpilogueTraits_ =
SimplifiedGemmEpilogueTraits<GemmConfig_, EpilogueFunctor_, Index_> >
struct SgemmTraits : public SimplifiedGemmTraits<
// The layout for A.
kLayoutA_,
// The layout for B.
kLayoutB_,
// The config.
GemmConfig_,
// The epilogue.
GemmEpilogue<GemmEpilogueTraits_>,
// The index.
Index_> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,84 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Template implementing matrix multiply-add operations on fragments.
*/
#pragma once
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Template performing matrix multiply-add operation within a thread
template <typename AccumulatorsPerThread_,
typename ThreadsPerWarp_,
typename ScalarA_,
typename ScalarB_,
typename ScalarC_>
struct ThreadMultiplyAdd {
/// The shape of the instruction.
typedef Shape<1, 1, 1, 1> InstructionShape;
/// The number of accumulators per thread.
typedef AccumulatorsPerThread_ AccumulatorsPerThread;
/// The number of threads per warp.
typedef ThreadsPerWarp_ ThreadsPerWarp;
/// The number of accumulators per warp.
typedef typename ShapeMul<AccumulatorsPerThread, ThreadsPerWarp>::Shape AccumulatorsPerWarp;
/// The type for A.
typedef ScalarA_ ScalarA;
/// The fragment for A.
typedef Fragment<ScalarA, AccumulatorsPerThread::kW> FragmentA;
/// The type for B.
typedef ScalarB_ ScalarB;
/// The fragment for B.
typedef Fragment<ScalarB, AccumulatorsPerThread::kH> FragmentB;
/// The type for C and D.
typedef ScalarC_ ScalarC;
/// The accumulators.
typedef Fragment<ScalarC, AccumulatorsPerThread::kH * AccumulatorsPerThread::kW, 16> Accumulators;
/// Ctor.
CUTLASS_DEVICE ThreadMultiplyAdd() {}
/// Multiply : d = a*b + c.
CUTLASS_DEVICE void multiply_add(FragmentA const& a,
FragmentB const& b,
Accumulators const& c,
Accumulators& d) {
for (int j = 0; j < AccumulatorsPerThread::kH; ++j) {
for (int i = 0; i < AccumulatorsPerThread::kW; ++i) {
d[j * AccumulatorsPerThread::kW + i] = a[i] * b[j] + c[j * AccumulatorsPerThread::kW + i];
}
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,161 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines structural properties of WMMA GEMM's epilogue phase.
*/
#pragma once
#include <cutlass/wmma_matrix.h>
#ifdef CUTLASS_USE_WMMA_API
#include <cutlass/convert.h>
#include <cutlass/coord.h>
#include <cutlass/gemm/gemm_global_stream.h>
#include <cutlass/gemm/gemm_shared_stream.h>
#include <cutlass/gemm/linear_scaling.h>
#include <cutlass/gemm/wmma_gemm_global_tile.h>
#include <cutlass/gemm/wmma_gemm_shared_tile.h>
#include <cutlass/reshape_tile.h>
#include <cutlass/tile_iterator.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_, typename EpilogueFunctor_, typename Index_ = int>
struct WmmaGemmEpilogueTraitsHelper {
/// The scalar.
typedef typename EpilogueFunctor_::Scalar Scalar;
/// The output tile.
typedef typename GemmConfig_::OutputTile OutputTile;
/// The number of WMMAs in the H dimension.
static int const kWmmasPerH =
GemmConfig_::AccumulatorsPerWarp::kH / GemmConfig_::InstructionShape::kH;
/// The number of iterations in the epilogue. That's the number of "horizontal" WMMAs.
typedef Shape<1, 1, kWmmasPerH> Iterations;
// The iteration strides in the H/W dimension.
typedef Shape<0, 0, 0> Delta;
/// The functor to do the math in the epilogue.
typedef EpilogueFunctor_ Functor;
/// The traits class to build the iterator to store to shared memory for D.
typedef WmmaGemmSharedStoreTileDTraits<
// The output layout.
MatrixLayout::kColumnMajor,
// The pointer is float.
typename Functor::Scalar,
// The output tile size.
typename GemmConfig_::OutputTile,
// The number of warps.
typename GemmConfig_::Warps,
// The shape of the instruction.
typename GemmConfig_::InstructionShape>
SharedStoreTileTraits;
typedef WmmaMatrix<GemmOperand::kC,
MatrixLayout::kColumnMajor,
Scalar,
typename GemmConfig_::InstructionShape>
WmmaMatrix;
/// The iterator to store D to shared memory.
typedef TileStoreIterator<SharedStoreTileTraits,
typename SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared,
Index_,
WmmaMatrix,
IteratorFragment::kWmmaMatrix>
SharedStoreIteratorD;
/// The shared store transformer for D.
typedef Copy<typename SharedStoreIteratorD::Fragment> SharedStoreTransformerD;
/// The traits class to build the iterator to load from shared memory for D.
typedef WmmaGemmSharedLoadTileDTraits<
// The pointer.
typename Functor::Scalar,
// The tile size.
typename SharedStoreIteratorD::Tile,
// The number of threads.
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDS.
GemmConfig_::kScalarsPerLdsD>
SharedLoadTileTraits;
/// The iterator to load D from shared memory.
typedef TileLoadIterator<SharedLoadTileTraits,
typename SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedLoadIteratorD;
/// The traits class to build the iterator to load data from global memory for C^N.
typedef WmmaGemmGlobalIteratorCdTraits<
// The pointer is float const.
typename GemmConfig_::ScalarC const,
// The tile has size (N / Iterations)xM in GEMM's terminology.
Shape<1,
GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgC>
GlobalLoadTileTraits;
/// The iterator to load C.
typedef WmmaGemmGlobalIteratorCd<GlobalLoadTileTraits, Index_> GlobalLoadIteratorC;
/// The transformer for C.
typedef Copy<typename GlobalLoadIteratorC::Fragment> GlobalTransformerC;
/// The traits class to build the iterator to store data to global memory for D^N.
typedef WmmaGemmGlobalIteratorCdTraits<
// The pointer is float.
typename GemmConfig_::ScalarD,
// The tile has size (N / Iterations)xM in GEMM's terminology.
Shape<1,
GemmConfig_::OutputTile::kH / ShapeCount<Iterations>::kCount,
GemmConfig_::OutputTile::kW>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, ShapeCount<typename GemmConfig_::Warps>::kCount, GemmConfig_::kWarpSize>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerStgD>
GlobalStoreTileTraits;
/// The iterator to store D.
typedef WmmaGemmGlobalIteratorCd<GlobalStoreTileTraits, Index_> GlobalStoreIteratorD;
/// The transformer for D.
typedef Copy<typename GlobalStoreIteratorD::Fragment> GlobalTransformerD;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass
#endif // defined CUTLASS_USE_WMMA_API

View File

@ -1,203 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines tile iterator traits for loading thread block-level tile from global memory.
*/
#pragma once
#include <cutlass/gemm/gemm_global_tile.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Tile_, typename Threads_, int kAccessSize_>
struct WmmaGemmGlobalIteratorCdTraits : public GemmGlobalTileTraits<GemmOperand::kC,
MatrixLayout::kColumnMajor,
Scalar_,
Tile_,
Threads_,
kAccessSize_> {
/// The base class.
typedef GemmGlobalTileTraits<GemmOperand::kC,
MatrixLayout::kColumnMajor,
Scalar_,
Tile_,
Threads_,
kAccessSize_>
Base;
/// Override the strides in each dimension between different loads/stores.
typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> Delta;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int thread_offset_h = threadIdx.x / Base::Threads::kW;
int thread_offset_w = threadIdx.x % Base::Threads::kW * Base::ThreadsDelta::kW;
return make_Coord(0, thread_offset_h, thread_offset_w, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename TileTraits_, typename Index_ = int>
struct WmmaGemmGlobalIteratorCd : public TileIteratorBase<TileTraits_,
typename TileTraits_::Scalar,
IteratorAdvance::kH,
MemorySpace::kGlobal,
Index_> {
/// This class.
typedef WmmaGemmGlobalIteratorCd<TileTraits_, Index_> This_;
/// The traits.
typedef TileTraits_ Traits;
/// The base class.
typedef TileIteratorBase<Traits,
typename TileTraits_::Scalar,
IteratorAdvance::kH,
MemorySpace::kGlobal,
Index_>
Base;
/// Override the strides in each dimension between different loads/stores.
typedef Shape<0, 0, Base::Delta::kW, Base::Delta::kC> ImmediateOffsetStrides;
/// The layout.
static MatrixLayout::Kind const kLayout = TileTraits_::kLayout;
/// The scalar.
typedef typename TileTraits_::Scalar Scalar;
/// The pointer.
typedef typename TileTraits_::Pointer Pointer;
/// The threads.
typedef typename TileTraits_::Threads Threads;
/// The index.
typedef Index_ Index;
/// The thread offset functor.
typedef typename TileTraits_::ThreadOffset ThreadOffset;
/// The params.
struct Params {
/// The pointer.
Pointer pointer;
/// The stride in the H dimension to setup the thread in the block.
Index stride_h;
/// The strides to increment the pointer.
Index inc_h, inc_advance;
/// The column offset to compute the predicate for the columns.
Index predicate_offset;
/// The strides to increment the predicate offset.
Index predicate_inc_h, predicate_inc_advance;
/// Setup the params.
CUTLASS_HOST_DEVICE int initialize(
Pointer pointer, Index ld, Index n, Index epilogue_stride_w, Index epilogue_delta_w) {
// The pointer.
this->pointer = pointer;
// Setup the base stride. One "group of threads" per column.
stride_h = ld;
// Each thread output 1 column per iteration. .
inc_h = ld * TileTraits_::Threads::kH;
inc_advance = inc_h + epilogue_stride_w;
predicate_offset = n;
predicate_inc_h = TileTraits_::Threads::kH;
predicate_inc_advance = predicate_inc_h + epilogue_delta_w;
// It worked.
return 0;
}
};
Params params;
Coord<4> thread_offset;
/// Ctor.
CUTLASS_DEVICE WmmaGemmGlobalIteratorCd() {}
/// Ctor.
CUTLASS_DEVICE WmmaGemmGlobalIteratorCd(Params const& params,
const Coord<3>& bounds,
const Coord<3>& block,
int const pointer_offset = 0,
int const pred_offset = 0,
ThreadOffset thread_offset_func = ThreadOffset())
: params(params) {
thread_offset = thread_offset_func();
// Each warp works on a different column of the tile.
int const h = thread_offset[1] + block[1];
// Each lane writes a different element.
int const w = thread_offset[2] + block[2];
// Setup the pointer.
this->params.pointer += ((h * params.stride_h + w) + pointer_offset);
// Prepare the vector of predicates.
for (int i = 0; i < Base::Iterations::kW; ++i) {
predicates.set(i, w + i * Base::Delta::kW < bounds[2]);
}
this->params.predicate_offset -= (h + pred_offset);
}
/// Increment the pointer in the C dimension.
CUTLASS_DEVICE void inc_c() {}
/// Increment the pointer in the W dimension.
CUTLASS_DEVICE void inc_w() {}
/// Increment the pointer in the H dimension.
CUTLASS_DEVICE void inc_h() {
params.pointer += params.inc_h;
params.predicate_offset -= params.predicate_inc_h;
}
/// Increment the pointer in the D dimension.
CUTLASS_DEVICE void inc_d() {}
/// Increment the pointer to move to the next iteration.
CUTLASS_DEVICE void inc_advance() {
params.pointer += params.inc_advance;
params.predicate_offset -= params.predicate_inc_advance;
}
/// Test the predicate.
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const {
return predicates.at(w) && params.predicate_offset > 0;
}
/// Returns the raw pointer
CUTLASS_HOST_DEVICE
Pointer data() { return params.pointer; }
CUTLASS_HOST_DEVICE
Pointer const data() const { return params.pointer; }
/// The predicates for the row.
cutlass::PredicateVector<Base::Iterations::kW> predicates;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass

View File

@ -1,108 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API.
*/
#pragma once
#include <cutlass/wmma_matrix.h>
#ifdef CUTLASS_USE_WMMA_API
#include <cutlass/fragment.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MatrixLayout::Kind kLayoutA_,
typename ScalarA_,
MatrixLayout::Kind kLayoutB_,
typename ScalarB_,
MatrixLayout::Kind kLayoutC_,
typename ScalarC_,
typename AccumulatorsPerWarp_,
typename InstructionShape_>
struct WmmaGemmMultiplyAdd {
/// The shape of the instruction.
typedef InstructionShape_ InstructionShape;
/// The number of threads per warp. That's a dummy configuration.
typedef Shape<1, InstructionShape_::kH, InstructionShape_::kW> ThreadsPerWarp;
/// The dimensions.
typedef AccumulatorsPerWarp_ AccumulatorsPerWarp;
/// The type for A.
typedef ScalarA_ ScalarA;
/// The type for B.
typedef ScalarB_ ScalarB;
/// The type for C and D.
typedef ScalarC_ ScalarC;
/// The number of iterations.
typedef typename ShapeDiv<AccumulatorsPerWarp, InstructionShape>::Shape Iterations;
/// The element for A.
typedef WmmaMatrix<GemmOperand::kA, kLayoutA_, ScalarA, InstructionShape> ElementA;
/// The fragment for A.
typedef Fragment<ElementA, Iterations::kW> FragmentA;
/// The element for B.
typedef WmmaMatrix<GemmOperand::kB, kLayoutB_, ScalarB, InstructionShape> ElementB;
/// The fragment for B.
typedef Fragment<ElementB, Iterations::kH> FragmentB;
/// The element for C.
typedef WmmaMatrix<GemmOperand::kC, kLayoutC_, ScalarC, InstructionShape> ElementC;
/// The fragment for C.
typedef Fragment<ElementC, Iterations::kH * Iterations::kW> Accumulators;
/// Ctor.
CUTLASS_DEVICE WmmaGemmMultiplyAdd() {}
/// Multiply : d = a*b.
CUTLASS_DEVICE void multiply_add(FragmentA const& a,
FragmentB const& b,
Accumulators const& c,
Accumulators& d) {
for (int j = 0; j < Iterations::kH; ++j) {
for (int i = 0; i < Iterations::kW; ++i) {
// The input elements.
ElementA const& elt_a = a[i];
ElementB const& elt_b = b[j];
ElementC const& elt_c = c[j * Iterations::kW + i];
// The output element.
ElementC& elt_d = d[j * Iterations::kW + i];
// The wmma instruction.
nvcuda::wmma::mma_sync(elt_d, elt_a, elt_b, elt_c);
}
}
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass
#endif // defined CUTLASS_USE_WMMA_API

View File

@ -1,240 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines iterator traits for efficiently loading and storing fragment to and from shared
memory, specialized for WMMA GEMM.
*/
#pragma once
#include <cutlass/wmma_matrix.h>
#ifdef CUTLASS_USE_WMMA_API
#include <cutlass/gemm/gemm_operand.h>
#include <cutlass/reshape_tile.h>
namespace cutlass {
namespace gemm {
template <class>
struct Debug {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MatrixLayout::Kind kLayout_,
typename Scalar_,
typename Tile_,
typename Warps_,
int kWarpStride_,
typename Iterations_,
typename Delta_,
typename WmmaShape_>
struct WmmaGemmSharedLoadTileATraits {
/// The operand.
static GemmOperand::Kind const kOperand = GemmOperand::kA;
/// The layout.
static MatrixLayout::Kind const kLayout = kLayout_;
/// The scalar.
typedef Scalar_ Scalar;
/// The pointer.
typedef Scalar const* Pointer;
/// The access size
static int const kAccessSize = 1;
/// The tile with skew.
typedef Tile_ Tile;
/// The number of warps.
typedef Warps_ Warps;
/// The warps strides.
static int const kWarpStride = kWarpStride_;
/// The number of iterations.
typedef Iterations_ Iterations;
/// The strides between iterations.
typedef Delta_ Delta;
/// The strides between iterations.
typedef Delta_ ImmediateOffsetStrides;
/// The shape of the WMMA instruction.
typedef WmmaShape_ WmmaShape;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// ThreadOffset
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// The warp id.
int const warp = threadIdx.x / kWarpSize;
// The offset.
int const offset = warp % Warps::kW * kWarpStride;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MatrixLayout::Kind kLayout_,
typename Scalar_,
typename Tile_,
typename Warps_,
int kWarpStride_,
typename Iterations_,
typename Delta_,
typename WmmaShape_>
struct WmmaGemmSharedLoadTileBTraits {
/// The operand.
static GemmOperand::Kind const kOperand = GemmOperand::kB;
/// The layout.
static MatrixLayout::Kind const kLayout = kLayout_;
/// The scalar.
typedef Scalar_ Scalar;
/// The pointer.
typedef Scalar const* Pointer;
/// The access size
static int const kAccessSize = 1;
/// The tile with skew.
typedef Tile_ Tile;
/// The number of warps.
typedef Warps_ Warps;
/// The warps strides.
static int const kWarpStride = kWarpStride_;
/// The number of iterations.
typedef Iterations_ Iterations;
/// The strides between iterations.
typedef Delta_ Delta;
/// The strides between iterations.
typedef Delta_ ImmediateOffsetStrides;
/// The shape of the WMMA instruction.
typedef WmmaShape_ WmmaShape;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// ThreadOffset
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// The warp id.
int const warp = threadIdx.x / kWarpSize;
// The offset.
int const offset = warp / Warps::kW * kWarpStride;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MatrixLayout::Kind kLayout_,
typename Scalar_,
typename OutputTile_,
typename Warps_,
typename WmmaShape_,
int kSkew_ = 0>
struct WmmaGemmSharedStoreTileDTraits {
/// The operand.
static GemmOperand::Kind const kOperand = GemmOperand::kC;
/// The layout.
static MatrixLayout::Kind const kLayout = kLayout_;
/// The scalar.
typedef Scalar_ Scalar;
// The access size
static int const kAccessSize = 1;
/// The pointer.
typedef Scalar* Pointer;
/// The number of warps.
typedef Warps_ Warps;
/// The shape of the WMMA instruction.
typedef WmmaShape_ WmmaShape;
/// The skew.
static int const kSkew = kSkew_;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The tile with skew.
typedef Shape<1, Warps_::kH * WmmaShape_::kH, OutputTile_::kW + kSkew_> Tile;
/// The number of iterations needed to store the tile.
typedef Shape<1, 1, OutputTile_::kW / Warps::kW / WmmaShape_::kW> Iterations;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, 0, Warps::kW * WmmaShape_::kW, 0> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, 0, Warps::kW * WmmaShape_::kW, 0> ImmediateOffsetStrides;
/// ThreadOffset
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// The warp id.
int const warp = threadIdx.x / kWarpSize;
// The starting column.
int const h = warp / Warps::kW * WmmaShape::kH;
// The w.
int const w = warp % Warps::kW * WmmaShape::kW;
// The offset.
int const offset = h * Tile::kW + w;
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, typename Tile_, typename Threads_, int kScalarsPerLds_>
struct WmmaGemmSharedLoadTileDTraits {
/// The scalar.
typedef Scalar_ Scalar;
/// The pointer.
typedef Scalar const* Pointer;
/// The access size
static int const kAccessSize = kScalarsPerLds_;
/// The tile.
typedef typename ReshapeTile<Tile_, kScalarsPerLds_>::Tile Tile;
/// The threads.
typedef typename ReshapeThreads<Tile, Threads_>::Threads Threads;
/// The threads strides.
typedef Shape<1, Tile::kW * Tile::kC, Tile::kC> ThreadsStrides;
/// The memory space.
static MemorySpace::Kind const kMemorySpace = MemorySpace::kShared;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kScalarsPerLds_> Delta;
/// The strides in each dimension between different loads/stores.
typedef Shape<0, Threads::kH * ShapeCount<Tile>::kWc, Threads::kW * kScalarsPerLds_>
ImmediateOffsetStrides;
/// The number of iterations needed to load/store the tile.
typedef Shape<1, Tile::kH / Threads::kH, Tile::kW / Threads::kW, Tile::kC / kScalarsPerLds_>
Iterations;
/// ThreadOffset
struct ThreadOffset {
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
// The offset.
int const offset = ComputeThreadOffsetFromStrides<Threads, ThreadsStrides>::get();
return make_Coord(0, 0, offset, 0);
}
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass
#endif // defined CUTLASS_USE_WMMA_API

View File

@ -1,574 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defies structural properties of GEMM targeting WMMA API in CUDA.
*/
#pragma once
#include <cutlass/wmma_matrix.h>
#ifdef CUTLASS_USE_WMMA_API
#include <cutlass/convert.h>
#include <cutlass/gemm/gemm.h>
#include <cutlass/gemm/gemm_epilogue.h>
#include <cutlass/gemm/gemm_epilogue_traits.h>
#include <cutlass/gemm/gemm_global_tile.h>
#include <cutlass/gemm/gemm_shared_tile.h>
#include <cutlass/gemm/gemm_traits.h>
#include <cutlass/gemm/wmma_gemm_epilogue_traits.h>
#include <cutlass/gemm/wmma_gemm_global_tile.h>
#include <cutlass/gemm/wmma_gemm_multiply_add.h>
namespace cutlass {
namespace gemm {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The tile size for the GEMM KxNxM.
typename OutputTile_,
/// The output type.
typename ScalarC_,
/// The accumulator type.
typename Accumulator_,
/// The number of accumulators per warp.
typename AccumulatorsPerWarp_,
/// The shape of the WMMA instruction.
typename InstructionShape_,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_>
struct WmmaGemmConfig : public GemmConfig<
/// The scalar type for A.
half,
/// The scalar type for B.
half,
/// The scalar type for C.
ScalarC_,
/// The scalar type for D.
ScalarC_,
/// The tile size for the GEMM KxNxM.
OutputTile_,
/// The functor to do the math in the main loop.
WmmaGemmMultiplyAdd<kLayoutA_,
half,
kLayoutB_,
half,
MatrixLayout::kColumnMajor,
Accumulator_,
AccumulatorsPerWarp_,
InstructionShape_>,
/// The number of scalars per LDG for A.
kScalarsPerLdgA_,
/// The number of scalars per STS for A.
kScalarsPerLdgA_,
/// The number of scalars per LDS for A.
8,
/// The number of scalars per LDG for B.
kScalarsPerLdgB_,
/// The number of scalars per STS for B.
kScalarsPerLdgB_,
/// The number of scalars per LDS for B.
8,
/// The number of scalars per LDG for C and STG for D.
16 / sizeof(ScalarC_),
/// The number of scalars per STS for D.
16 / sizeof(ScalarC_),
/// The number of scalars per LDS for D.
16 / sizeof(ScalarC_),
/// The number of stages in shared memory.
1> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct WmmaGemmTileTraitsHelperA {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct WmmaGemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_>
: public GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperA<MatrixLayout::kColumnMajor, GemmConfig_> Base;
/// The skew.
static int const kSkew = 16 / sizeof(typename Base::MultiplyAddScalar);
/// The shared tile size.
typedef Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD,
GemmConfig_::OutputTile::kW + kSkew>
Tile;
/// WMMA matrix
typedef WmmaMatrix<GemmOperand::kA,
MatrixLayout::kColumnMajor,
typename Base::MultiplyAddScalar,
typename GemmConfig_::InstructionShape>
WmmaMatrix;
/// The traits class to build the iterator to store data to shared memory for A^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer.
typename Base::MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Tile,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename Base::GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsA>
SharedStoreTileTraits;
/// The number of elements loaded in one LDG.
static int const kScalarsPerW = GemmConfig_::InstructionShape::kW * GemmConfig_::Warps::kW;
/// The number of scalars loaded per iteration.
static int const kScalarsPerIteration = Tile::kW * GemmConfig_::InstructionShape::kD;
/// The traits class to build the iterator to load from shared memory for A.
typedef WmmaGemmSharedLoadTileATraits<
// The layout of the matrix.
MatrixLayout::kColumnMajor,
// The pointer.
typename Base::MultiplyAddScalar,
// The output tile size.
Tile,
// The number of warps.
typename GemmConfig_::Warps,
// The strides between warps.
GemmConfig_::InstructionShape::kW,
// The number of iterations to load the data.
Shape<1, 1, GemmConfig_::OutputTile::kW / kScalarsPerW>,
// The stride between iterations.
Shape<kScalarsPerIteration, 0, kScalarsPerW, 0>,
// The shape of the instruction.
typename GemmConfig_::InstructionShape>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct WmmaGemmTileTraitsHelperA<MatrixLayout::kRowMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kRowMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarA Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarA MultiplyAddScalar;
/// WMMA matrix
typedef WmmaMatrix<GemmOperand::kA,
MatrixLayout::kRowMajor,
MultiplyAddScalar,
typename GemmConfig_::InstructionShape>
WmmaMatrix;
/// The traits class to build the iterator to load data from global memory for A^T.
typedef GemmGlobalTileTraits<
// That's A.
GemmOperand::kA,
// A is row-major.
MatrixLayout::kRowMajor,
// The pointer is float const.
Scalar const,
// The tile has size KxM in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kW, GemmConfig_::OutputTile::kD>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgA>
GlobalTileTraits;
/// The skew.
static int const kSkew = 16 / sizeof(MultiplyAddScalar);
/// The tile.
typedef Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kW,
GemmConfig_::OutputTile::kD + kSkew>
Tile;
/// The traits class to build the iterator to store data to shared memory for A^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer.
MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Tile,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsA>
SharedStoreTileTraits;
/// The number of elements loaded in one LDG.
static int const kScalarsPerW = GemmConfig_::InstructionShape::kW * GemmConfig_::Warps::kW;
/// The traits class to build the iterator to load from shared memory for A.
typedef WmmaGemmSharedLoadTileATraits<
// The layout of the matrix.
MatrixLayout::kRowMajor,
// The pointer.
MultiplyAddScalar,
// The tile in shared memory.
Tile,
// The number of warps.
typename GemmConfig_::Warps,
// The strides between warps.
GemmConfig_::InstructionShape::kW * Tile::kW,
// The number of iterations to load the data.
Shape<1, 1, GemmConfig_::OutputTile::kW / kScalarsPerW>,
// The stride between iterations.
Shape<GemmConfig_::InstructionShape::kD, 0, kScalarsPerW * Tile::kW>,
// The shape of the instruction.
typename GemmConfig_::InstructionShape>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <enum MatrixLayout::Kind kLayout_, typename GemmConfig_>
struct WmmaGemmTileTraitsHelperB {};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct WmmaGemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_>
: public GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> {
/// The base config.
typedef GemmTileTraitsHelperB<MatrixLayout::kRowMajor, GemmConfig_> Base;
/// The skew.
static int const kSkew = 16 / sizeof(typename Base::MultiplyAddScalar);
/// The shared tile size.
typedef Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kD,
GemmConfig_::OutputTile::kH + kSkew>
Tile;
/// WMMA matrix
typedef WmmaMatrix<GemmOperand::kB,
MatrixLayout::kRowMajor,
typename Base::MultiplyAddScalar,
typename GemmConfig_::InstructionShape>
WmmaMatrix;
/// The traits class to build the iterator to store data to shared memory for B^T.
typedef GemmSharedStoreTileAbTraits<
// The pointer.
typename Base::MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Tile,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename Base::GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsB>
SharedStoreTileTraits;
/// The number of elements loaded in one LDG.
static int const kScalarsPerW = GemmConfig_::InstructionShape::kH * GemmConfig_::Warps::kH;
/// The number of scalars loaded per iteration.
static int const kScalarsPerIteration = Tile::kW * GemmConfig_::InstructionShape::kD;
/// The traits class to build the iterator to load from shared memory for B.
typedef WmmaGemmSharedLoadTileBTraits<
// The layout of the matrix.
MatrixLayout::kRowMajor,
// The pointer.
typename Base::MultiplyAddScalar,
// The output tile size.
Tile,
// The number of warps.
typename GemmConfig_::Warps,
// The strides between warps.
GemmConfig_::InstructionShape::kH,
// The number of iterations to load the data.
Shape<1, 1, GemmConfig_::OutputTile::kH / kScalarsPerW>,
// The stride between iterations.
Shape<kScalarsPerIteration, 0, kScalarsPerW, 0>,
// The shape of the instruction.
typename GemmConfig_::InstructionShape>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename GemmConfig_>
struct WmmaGemmTileTraitsHelperB<MatrixLayout::kColumnMajor, GemmConfig_> {
/// The layout.
static MatrixLayout::Kind const kLayout = MatrixLayout::kColumnMajor;
/// The input scalar.
typedef typename GemmConfig_::ScalarB Scalar;
/// The scalar stored in shared memory.
typedef typename GemmConfig_::MultiplyAdd::ScalarB MultiplyAddScalar;
/// WMMA matrix
typedef WmmaMatrix<GemmOperand::kB,
MatrixLayout::kColumnMajor,
MultiplyAddScalar,
typename GemmConfig_::InstructionShape>
WmmaMatrix;
/// The traits class to build the iterator to load data from global memory for B^N.
typedef GemmGlobalTileTraits<
// That's B.
GemmOperand::kB,
// A is row-major.
MatrixLayout::kColumnMajor,
// The pointer is float const.
Scalar const,
// The tile has size KxM in GEMM's terminology.
Shape<1, GemmConfig_::OutputTile::kH, GemmConfig_::OutputTile::kD>,
// The threads are distributed as warps x 32 (the traits may reorganize).
Shape<1, GemmConfig_::kThreads / GemmConfig_::OutputTile::kD, GemmConfig_::OutputTile::kD>,
// The number of scalars per LDG (LDG.32 or LDG.128, etc).
GemmConfig_::kScalarsPerLdgB>
GlobalTileTraits;
/// The skew.
static int const kSkew = 16 / sizeof(MultiplyAddScalar);
/// The tile.
typedef Shape<GemmConfig_::kStages,
GemmConfig_::OutputTile::kH,
GemmConfig_::OutputTile::kD + kSkew>
Tile;
/// The traits class to build the iterator to store data to shared memory for B^N.
typedef GemmSharedStoreTileAbTraits<
// The pointer.
MultiplyAddScalar,
// The tile has size KxM in GEMM's terminology.
Tile,
// The threads are distributed as warps x 32 (the traits may reorganize).
typename GlobalTileTraits::Threads,
// The number of scalars per STS (STS.32 or STS.128, etc).
GemmConfig_::kScalarsPerStsB>
SharedStoreTileTraits;
/// The number of elements loaded in one LDG.
static int const kScalarsPerW = GemmConfig_::InstructionShape::kH * GemmConfig_::Warps::kH;
/// The traits class to build the iterator to load from shared memory for B.
typedef WmmaGemmSharedLoadTileBTraits<
// The layout of the matrix.
MatrixLayout::kColumnMajor,
// The pointer.
MultiplyAddScalar,
// The tile in shared memory.
Tile,
// The number of warps.
typename GemmConfig_::Warps,
// The strides between warps.
GemmConfig_::InstructionShape::kH * Tile::kW,
// The number of iterations to load the data.
Shape<1, 1, GemmConfig_::OutputTile::kH / kScalarsPerW>,
// The stride between iterations.
Shape<GemmConfig_::InstructionShape::kD, 0, kScalarsPerW * Tile::kW>,
// The shape of the instruction.
typename GemmConfig_::InstructionShape>
SharedLoadTileTraits;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The output tile.
typename OutputTile_,
/// The output type.
typename ScalarC_,
/// The accumulator type.
typename Accumulator_,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_,
/// The number of accumulators per warp.
typename AccumulatorsPerWarp_,
/// The shape of the WMMA instruction.
typename InstructionShape_,
/// The number of halfs loaded in one LDG for A.
int kScalarsPerLdgA_,
/// The number of halfs loaded in one LDG for B.
int kScalarsPerLdgB_,
/// The index.
typename Index_>
struct WmmaGemmTraitsHelper {
/// The WMMA GEMM config.
typedef WmmaGemmConfig<kLayoutA_,
kLayoutB_,
OutputTile_,
ScalarC_,
Accumulator_,
AccumulatorsPerWarp_,
InstructionShape_,
kScalarsPerLdgA_,
kScalarsPerLdgB_>
GemmConfig;
/// The GEMM config for A.
typedef WmmaGemmTileTraitsHelperA<kLayoutA_, GemmConfig> GemmTileTraitsHelperA;
/// The GEMM config for B.
typedef WmmaGemmTileTraitsHelperB<kLayoutB_, GemmConfig> GemmTileTraitsHelperB;
/// The iterator to load A from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperA::GlobalTileTraits, Index_>
GlobalLoadIteratorA;
/// The default transformer for A.
typedef Copy<typename GlobalLoadIteratorA::Fragment> GlobalTransformerA;
/// The iterator to store A to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperA::SharedStoreTileTraits,
typename GemmTileTraitsHelperA::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorA;
/// The stream to load A from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorA, SharedStoreIteratorA, GlobalTransformerA>
GlobalLoadStreamA;
/// The iterator to load B from global memory.
typedef GemmGlobalIteratorAb<typename GemmTileTraitsHelperB::GlobalTileTraits, Index_>
GlobalLoadIteratorB;
// The default transformer for B.
typedef Copy<typename GlobalLoadIteratorB::Fragment> GlobalTransformerB;
/// The iterator to store B to shared memory.
typedef TileStoreIterator<typename GemmTileTraitsHelperB::SharedStoreTileTraits,
typename GemmTileTraitsHelperB::SharedStoreTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared>
SharedStoreIteratorB;
/// The stream to load B from global memory to shared memory.
typedef GlobalLoadStream<GlobalLoadIteratorB, SharedStoreIteratorB, GlobalTransformerB>
GlobalLoadStreamB;
/// The iterator to load A from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperA::SharedLoadTileTraits,
typename GemmTileTraitsHelperA::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared,
Index_,
typename GemmTileTraitsHelperA::WmmaMatrix,
IteratorFragment::kWmmaMatrix>
SharedLoadIteratorA;
/// The stream to load A from shared memory.
typedef SharedLoadStream<SharedLoadIteratorA> SharedLoadStreamA;
/// The iterator to load B from shared memory.
typedef TileLoadIterator<typename GemmTileTraitsHelperB::SharedLoadTileTraits,
typename GemmTileTraitsHelperB::SharedLoadTileTraits::Scalar,
IteratorAdvance::kH,
MemorySpace::kShared,
Index_,
typename GemmTileTraitsHelperB::WmmaMatrix,
IteratorFragment::kWmmaMatrix>
SharedLoadIteratorB;
/// The stream to load B from shared memory.
typedef SharedLoadStream<SharedLoadIteratorB> SharedLoadStreamB;
/// The functor to do the multiply-add in the main loop.
typedef typename GemmConfig::MultiplyAdd MultiplyAdd;
/// The object to clear accumulators.
typedef ClearAccumulators<typename MultiplyAdd::ScalarC> ClearAccumulators;
/// The helper to create the epilogue traits.
typedef WmmaGemmEpilogueTraitsHelper<GemmConfig, EpilogueFunctor_, Index_> EpilogueTraitsHelper;
/// The traits class for the epilogue.
typedef SimplifiedGemmEpilogueTraits<GemmConfig, EpilogueFunctor_, Index_, EpilogueTraitsHelper>
GemmEpilogueTraits;
/// The epilogue.
typedef GemmEpilogue<GemmEpilogueTraits> Epilogue;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename OutputTile_, typename DefaultShape_ = Shape<64, 32, 64> >
struct WmmaGemmAccumulatorsPerWarp {
typedef typename ShapeMin<OutputTile_, DefaultShape_>::Shape Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <
/// The layout for A.
MatrixLayout::Kind kLayoutA_,
/// The layout for B.
MatrixLayout::Kind kLayoutB_,
/// The tile size for the GEMM KxNxM.
typename OutputTile_ = Shape<64, 128, 128>,
/// The output type.
typename ScalarC_ = float,
/// The functor to do the math in the epilogue.
typename EpilogueFunctor_ = LinearScaling<ScalarC_>,
/// The accumulator type.
typename Accumulator_ = ScalarC_,
/// The number of accumulators per warp.
typename AccumulatorsPerWarp_ = typename WmmaGemmAccumulatorsPerWarp<OutputTile_>::Shape,
/// The shape of the WMMA instruction.
typename InstructionShape_ = Shape<16, 16, 16>,
/// The number of scalars per LDG for A.
int kScalarsPerLdgA_ = 8,
/// The number of scalars per LDG for B.
int kScalarsPerLdgB_ = 8,
/// The index.
typename Index_ = int,
/// The helper class.
typename Helper_ = WmmaGemmTraitsHelper<kLayoutA_,
kLayoutB_,
OutputTile_,
ScalarC_,
Accumulator_,
EpilogueFunctor_,
AccumulatorsPerWarp_,
InstructionShape_,
kScalarsPerLdgA_,
kScalarsPerLdgB_,
Index_> >
struct WmmaGemmTraits : public GemmTraits<
// The config.
typename Helper_::GemmConfig,
// The stream to load A from global memory to shared memory.
typename Helper_::GlobalLoadStreamA,
// The stream to load B from global memory to shared memory.
typename Helper_::GlobalLoadStreamB,
// The stream to load A from shared memory.
typename Helper_::SharedLoadStreamA,
// The stream to load B from shared memory.
typename Helper_::SharedLoadStreamB,
// The epilogue.
typename Helper_::Epilogue,
// The block swizzle to reorganize the grid.
IdentityBlockSwizzle,
// The index.
Index_,
// The tool used to clear accumulators.
typename Helper_::ClearAccumulators> {};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace gemm
} // namespace cutlass
#endif // defined CUTLASS_USE_WMMA_API

View File

@ -1,325 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Free functions for loading and storing to implementations of tile iteartor concepts.
*/
#pragma once
#include <cutlass/fragment_load_store.h>
#include <cutlass/load_store.h>
#include <cutlass/predicate_vector.h>
#include <cutlass/shape.h>
namespace cutlass {
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Loads a fragment from an input iterator
template <typename InputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_load(InputIterator &iterator, Fragment &fragment) {
typename InputIterator::FragmentIterator frag_iterator(fragment);
for (int d = 0; d < InputIterator::Iterations::kD; ++d) {
for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
if (iterator.valid(d, h, w, c)) {
int const offset =
ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
0, 0, w, c);
Load<typename Fragment::Element, InputIterator::Tile::kC, InputIterator::kMemorySpace>::
load(reinterpret_cast<typename InputIterator::AccessType &>(
frag_iterator.at(d, h, w, c)),
iterator.data(),
offset);
}
}
if (w < InputIterator::Iterations::kW - 1) {
iterator.inc_w();
}
}
if (h < InputIterator::Iterations::kH - 1) {
iterator.inc_h();
}
}
if (d < InputIterator::Iterations::kD - 1) {
iterator.inc_d();
}
}
iterator.inc_advance();
}
/// Loads a fragment from a shared memory input iterator
template <typename InputIterator, typename Fragment>
CUTLASS_DEVICE void shared_iterator_load(InputIterator &iterator, Fragment &fragment) {
typename InputIterator::FragmentIterator frag_iterator(fragment);
for (int d = 0; d < InputIterator::Iterations::kD; ++d) {
for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
int const offset =
ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
d, h, w, c);
FragmentLoad<InputIterator::kIteratorFragment,
InputIterator::Tile::kC,
typename InputIterator::Scalar,
InputIterator::kMemorySpace,
typename InputIterator::FragmentElement,
InputIterator::Tile::kW>::load(frag_iterator.at(d, h, w, c),
iterator.data(),
offset);
}
}
}
}
}
/// Loads a fragment from a shared memory input iterator
template <typename InputIterator, typename Fragment>
CUTLASS_DEVICE void shared_iterator_load(InputIterator &iterator, Fragment &fragment, int d) {
typename InputIterator::FragmentIterator frag_iterator(fragment);
for (int h = 0; h < InputIterator::Iterations::kH; ++h) {
for (int w = 0; w < InputIterator::Iterations::kW; ++w) {
for (int c = 0; c < InputIterator::Iterations::kC; ++c) {
int const offset =
ComputeOffsetFromStrides<typename InputIterator::ImmediateOffsetStrides>::get(
d, h, w, c);
FragmentLoad<InputIterator::kIteratorFragment,
InputIterator::Tile::kC,
typename InputIterator::Scalar,
InputIterator::kMemorySpace,
typename InputIterator::FragmentElement,
InputIterator::Tile::kW>::load(frag_iterator.at(0, h, w, c),
iterator.data(),
offset);
}
}
}
}
/// Loads a fragment from an input iterator, masked by a predicate iterator
template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
Fragment &fragment,
typename InputIterator::Index offset,
ConstPredicateAdapter predicate_adapter) {
for (int d = 0; d < InputIterator::Iterations::kD; ++d, iterator.inc_d()) {
for (int h = 0; h < InputIterator::Iterations::kH; ++h, iterator.inc_h()) {
for (int w = 0; w < InputIterator::Iterations::kW; ++w, iterator.inc_w()) {
if (predicate_adapter.at(d, h, w, 0)) {
int idx = InputIterator::Tile::kC *
(w + InputIterator::Iterations::kW * (h + InputIterator::Iterations::kH * d));
Load<typename Fragment::Element, InputIterator::Tile::kC, InputIterator::kMemorySpace>::
load(reinterpret_cast<typename InputIterator::AccessType &>(fragment[idx]),
iterator.data(),
offset);
}
}
}
}
}
/// Loads a fragment from an input iterator
template <typename InputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
Fragment &fragment,
typename InputIterator::Index offset = 0) {
TrivialPredicateTileAdapter pred;
iterator_load_post_increment(iterator, fragment, offset, pred);
}
/// Loads a fragment from an input iterator
template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_load_post_increment(InputIterator &iterator,
Fragment &fragment,
ConstPredicateAdapter pred_it) {
iterator_load_post_increment(iterator, fragment, 0, pred_it);
}
template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &_iterator,
Fragment &fragment,
typename InputIterator::Index offset,
ConstPredicateAdapter predicate_adapter) {
InputIterator iterator(_iterator);
iterator_load_post_increment(iterator, fragment, offset, predicate_adapter);
}
/// Loads a fragment from an input iterator
template <typename InputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &iterator,
Fragment &fragment,
typename InputIterator::Index offset = 0) {
TrivialPredicateTileAdapter pred;
iterator_load(iterator, fragment, offset, pred);
}
/// Loads a fragment from an input iterator
template <typename InputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_load(InputIterator const &iterator,
Fragment &fragment,
ConstPredicateAdapter pred_it) {
iterator_load(iterator, fragment, 0, pred_it);
}
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Stores a fragment to an output iterator
template <typename OutputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_store(OutputIterator &iterator, Fragment &fragment) {
typename OutputIterator::FragmentIterator frag_iterator(fragment);
for (int d = 0; d < OutputIterator::Iterations::kD; ++d) {
for (int h = 0; h < OutputIterator::Iterations::kH; ++h) {
for (int w = 0; w < OutputIterator::Iterations::kW; ++w) {
if (iterator.valid(d, h, w, 0)) {
int const offset =
ComputeOffsetFromStrides<typename OutputIterator::ImmediateOffsetStrides>::get(
d, h, w, 0);
Store<typename Fragment::Element,
OutputIterator::Tile::kC,
OutputIterator::kMemorySpace>::
store(reinterpret_cast<typename OutputIterator::AccessType &>(
frag_iterator.at(d, h, w, 0)),
iterator.data(),
offset);
}
if (w < OutputIterator::Iterations::kW - 1) {
iterator.inc_w();
}
}
if (h < OutputIterator::Iterations::kH - 1) {
iterator.inc_h();
}
}
if (d < OutputIterator::Iterations::kD - 1) {
iterator.inc_d();
}
}
iterator.inc_advance();
}
/// Stores a fragment to a shared memory output iterator
template <typename OutputIterator, typename Fragment>
CUTLASS_DEVICE void shared_iterator_store(OutputIterator &iterator, Fragment const &fragment) {
typename OutputIterator::FragmentConstIterator frag_iterator(fragment);
for (int d = 0; d < OutputIterator::Iterations::kD; ++d) {
for (int h = 0; h < OutputIterator::Iterations::kH; ++h) {
for (int w = 0; w < OutputIterator::Iterations::kW; ++w) {
for (int c = 0; c < OutputIterator::Iterations::kC; ++c) {
int const offset =
ComputeOffsetFromStrides<typename OutputIterator::ImmediateOffsetStrides>::get(
d, h, w, c);
FragmentStore<OutputIterator::kIteratorFragment,
OutputIterator::Tile::kC,
typename OutputIterator::Scalar,
OutputIterator::kMemorySpace,
typename OutputIterator::FragmentElement,
OutputIterator::Tile::kW>::store(frag_iterator.at(d, h, w, c),
iterator.data(),
offset);
}
}
}
}
}
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Stores a fragment to an output iterator, masked by a predicate iterator
template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
Fragment const &fragment,
typename OutputIterator::Index offset,
ConstPredicateAdapter predicate_adapter) {
for (int d = 0; d < OutputIterator::Iterations::kD; ++d, iterator.inc_d()) {
for (int h = 0; h < OutputIterator::Iterations::kH; ++h, iterator.inc_h()) {
for (int w = 0; w < OutputIterator::Iterations::kW; ++w, iterator.inc_w()) {
if (predicate_adapter.at(d, h, w, 0)) {
int idx = OutputIterator::Tile::kC *
(w + OutputIterator::Iterations::kW * (h + OutputIterator::Iterations::kH * d));
Store<typename Fragment::Element,
OutputIterator::Tile::kC,
OutputIterator::kMemorySpace>::
store(reinterpret_cast<typename OutputIterator::AccessType const &>(fragment[idx]),
iterator.data(),
offset);
}
}
}
}
}
/// Stores a fragment to an output iterator
template <typename OutputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
Fragment const &fragment,
typename OutputIterator::Index offset = 0) {
TrivialPredicateTileAdapter pred;
iterator_store_post_increment(iterator, fragment, offset, pred);
}
/// Stores a fragment to an output iterator
template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_store_post_increment(OutputIterator &iterator,
Fragment const &fragment,
ConstPredicateAdapter pred_it) {
iterator_store_post_increment(iterator, fragment, 0, pred_it);
}
/// Stores a fragment to an output iterator, masked by a predicate iterator
template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &_iterator,
Fragment const &fragment,
typename OutputIterator::Index offset,
ConstPredicateAdapter predicate_adapter) {
OutputIterator iterator(_iterator);
iterator_store_post_increment(iterator, fragment, offset, predicate_adapter);
}
/// Stores a fragment to an output iterator
template <typename OutputIterator, typename Fragment>
CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &iterator,
Fragment const &fragment,
typename OutputIterator::Index offset = 0) {
TrivialPredicateTileAdapter pred;
iterator_store(iterator, fragment, offset, pred);
}
/// Stores a fragment to an output iterator
template <typename OutputIterator, typename Fragment, typename ConstPredicateAdapter>
CUTLASS_HOST_DEVICE void iterator_store(OutputIterator const &iterator,
Fragment const &fragment,
ConstPredicateAdapter pred_it) {
iterator_store(iterator, fragment, 0, pred_it);
}
///////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,199 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines abstractions for efficiently loading and storing vectors to memory.
*/
#pragma once
#include <cutlass/vector.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Enum to specify which memory space data resides in.
*/
struct MemorySpace {
enum Kind {
kGeneric, // Data accessed through pointer dereferencing
kShared, // Data resides in shared memory
kGlobal // Data resides in global memory
};
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
int Lanes_,
MemorySpace::Kind Memory_,
bool = (Lanes_ > 1),
size_t = (sizeof(Scalar_) * Lanes_)>
struct Load {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The load function.
static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
dst = reinterpret_cast<AccessType const*>(&pointer[offset])[0];
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Load<Scalar_, Lanes_, Memory_, true, 4> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
dst.registers[0] = reinterpret_cast<uint32_t const*>(&pointer[offset])[0];
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Load<Scalar_, Lanes_, Memory_, true, 8> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
uint2 tmp = reinterpret_cast<uint2 const*>(&pointer[offset])[0];
dst.registers[0] = tmp.x;
dst.registers[1] = tmp.y;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MemorySpace::Kind Memory_>
struct Load<double, 2, Memory_, true, 16> {
/// The output type.
typedef typename Vectorize<double, 2>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void load(AccessType& dst, double const* pointer, int offset) {
double2 tmp = reinterpret_cast<double2 const*>(&pointer[offset])[0];
dst[0] = tmp.x;
dst[1] = tmp.y;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Load<Scalar_, Lanes_, Memory_, true, 16> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void load(AccessType& dst, Scalar_ const* pointer, int offset) {
uint4 tmp = reinterpret_cast<uint4 const*>(&pointer[offset])[0];
dst.registers[0] = tmp.x;
dst.registers[1] = tmp.y;
dst.registers[2] = tmp.z;
dst.registers[3] = tmp.w;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_,
int Lanes_,
MemorySpace::Kind Memory_,
bool = (Lanes_ > 1),
size_t = (sizeof(Scalar_) * Lanes_)>
struct Store {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
pointer[offset] = src;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Store<Scalar_, Lanes_, Memory_, true, 4> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
uint32_t* addr = reinterpret_cast<uint32_t*>(&pointer[offset]);
addr[0] = src.registers[0];
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Store<Scalar_, Lanes_, Memory_, true, 8> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
uint2* addr = reinterpret_cast<uint2*>(&pointer[offset]);
addr[0] = make_uint2(src.registers[0], src.registers[1]);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <MemorySpace::Kind Memory_>
struct Store<double, 2, Memory_, true, 16> {
/// The output type.
typedef typename Vectorize<double, 2>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& src, double* pointer, int offset) {
double2* addr = reinterpret_cast<double2*>(&pointer[offset]);
addr[0] = make_double2(src[0], src[1]);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int Lanes_, MemorySpace::Kind Memory_>
struct Store<Scalar_, Lanes_, Memory_, true, 16> {
/// The output type.
typedef typename Vectorize<Scalar_, Lanes_>::Type AccessType;
/// The store function.
static CUTLASS_DEVICE void store(AccessType const& src, Scalar_* pointer, int offset) {
uint4* addr = reinterpret_cast<uint4*>(&pointer[offset]);
addr[0] = make_uint4(src.registers[0], src.registers[1], src.registers[2], src.registers[3]);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,48 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines properties of matrices used to denote layout and operands to GEMM kernels.
*/
#pragma once
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Describes layouts of matrices
struct MatrixLayout {
enum Kind { kRowMajor, kColumnMajor };
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Gemm operand - D = A * B + C
struct GemmOperand {
enum Kind { kA, kB, kC, kD };
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,493 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines container classes and iterators for managing a statically sized vector
of boolean predicates.
*/
#pragma once
#include <stdint.h>
#include <cutlass/cutlass.h>
#include <cutlass/shape.h>
#include <cutlass/util/platform.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup predicate_vector_concept Predicate Vector Concept
@{
Implementations of \ref predicate_vector_concept contain an ordered set of boolean predicates which
may be used as conditionals in other device-side operations. Both random access and iterators
offering sequential access are provided.
@par Predicate Vector
A \ref predicate_vector_concept satisfies the following expressions
- <b>at(int idx)</b> - returns the value of the indexed predicate
- <b>set(int idx, bool value)</b> - sets the value of the indexed predicate
- <b>begin()</b> - returns a \ref predicate_iterator_concept pointing to the first predicate
@}
*/
////////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup predicate_iterator_concept Predicate Iterator Concept
@{
Implementations of \ref predicate_iterator_concept enables accessing and traversing elements of a
bit vector.
@par Const Predicate Iterator
A const \ref predicate_iterator_concept satisfies the following expressions
- <b>++it</b> increments the iterator to the next predicate
- <b>*it</b> returns the value of the currently pointed-to predicate
@par Mutable Predicate Iterator
A \ref predicate_iterator_concept that is non-const <b>also</b> satisfies the following expressions
- <b>it.set(bool value)</b> sets the value of the currently pointed-to predicate
@}
*/
////////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup predicate_tile_adapter Predicate Tile Adapter Concept
@{
Implementations of \ref predicate_tile_adapter provide a mapping between a the elements of a \ref
tile_traits_concept and a \ref predicate_vector_concept.
@par Predicate Tile Adapter
A \ref predicate_tile_adapter satisfies the following expressions
- <b>at(int d, int h, int w, int c)</b> - returns the value of a predicate corresponding to the
access (d, h, w, c) within the tile.
@}
*/
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Statically sized array of bits implementing @concept{predicate_vector_concept}.
template <
/// Number of predicates conatined in predicate vector
int kPredicates_,
/// Number of predicates contained in each byte of internal storage
int kPredicatesPerByte_ = 4,
/// Location of first predicate within byte of internal storage
int kPredicateStart_ = 0>
struct PredicateVector {
/// Number of bits stored by the PredicateVector
static int const kPredicates = kPredicates_;
/// Number of bits stored within each byte of the predicate bit vector
static int const kPredicatesPerByte = kPredicatesPerByte_;
/// First bit withing each byte containing predicates
static int const kPredicateStart = kPredicateStart_;
// Make sure no one tries to put more than 8 bits in a byte :)
static_assert(kPredicatesPerByte <= 8, "kPredicatesPerByte must fit within an actual byte");
// Make sure the "offsetted" bits fit in one byte.
static_assert(kPredicateStart + kPredicatesPerByte < 8,
"The offsetted predicates must fit within an actual byte.");
/// Storage type of individual elements
typedef uint32_t Storage;
/// Number of bytes needed
static int const kBytes = (kPredicates + kPredicatesPerByte - 1) / kPredicatesPerByte;
/// Number of storage elements needed
static int const kWordCount = (kBytes + sizeof(Storage) - 1) / sizeof(Storage);
private:
//
// Data members
//
/// Words of bit vector
Storage storageData[kWordCount];
//
// Methods
//
/// Computes the word and bit corresponding to a logical predicate index
CUTLASS_HOST_DEVICE void computeStorageOffset(int &word, int &bit, int idx) const {
CUTLASS_ASSERT(idx < kPredicates);
int byte = (idx / kPredicatesPerByte);
int bit_offset = (idx % kPredicatesPerByte);
word = byte / sizeof(Storage);
int byte_offset = (byte % sizeof(Storage));
bit = byte_offset * 8 + bit_offset + kPredicateStart;
}
/// Accesses a given word with optional assertions
CUTLASS_HOST_DEVICE Storage &storage(int word) {
CUTLASS_ASSERT(word < kWordCount);
return storageData[word];
}
/// Accesses a given word with optional assertions
CUTLASS_HOST_DEVICE Storage const &storage(int word) const {
CUTLASS_ASSERT(word < kWordCount);
return storageData[word];
}
public:
//
// Iterator
//
/**
* @brief A const iterator implementing \ref predicate_iterator_concept enabling sequential
* read-only access to prediactes.
* @concept{predicate_iterator_concept}
*/
class ConstIterator {
/// Reference to PredicateVector instance
PredicateVector const &vec_;
/// Index into PredicateVector
int bit_;
public:
/// Copy constructor
CUTLASS_HOST_DEVICE
ConstIterator(ConstIterator const &it) : vec_(it.vec_), bit_(it.bit_) {}
///
CUTLASS_HOST_DEVICE
ConstIterator(PredicateVector const &_vec, int _start = 0) : vec_(_vec), bit_(_start) {}
/// Pre-increment
CUTLASS_HOST_DEVICE
ConstIterator &operator++() {
++bit_;
return *this;
}
/// Pre-decrement
CUTLASS_HOST_DEVICE
ConstIterator &operator--() {
--bit_;
return *this;
}
/// Post-increment
CUTLASS_HOST_DEVICE
ConstIterator operator++(int) {
ConstIterator ret(*this);
ret.bit_++;
return ret;
}
/// Post-decrement
CUTLASS_HOST_DEVICE
ConstIterator operator--(int) {
ConstIterator ret(*this);
ret.bit_--;
return ret;
}
/// Returns true if iterators point to the same bit
CUTLASS_HOST_DEVICE
bool operator==(ConstIterator const &it) const { return bit_ == it.bit_; }
/// Returns false if iterators point to the same bit
CUTLASS_HOST_DEVICE
bool operator!=(ConstIterator const &it) const { return bit_ != it.bit_; }
/// Dereferences iterator
CUTLASS_HOST_DEVICE
bool operator*() const { return vec_[bit_]; }
};
/**
* @brief An iterator implementing \ref predicate_iterator_concept enabling sequential
* read and write access to predicates.
* @concept{predicate_iterator_concept}
*/
class Iterator {
/// Reference to PredicateVector instance
PredicateVector &vec_;
/// Index into PredicateVector
int bit_;
public:
/// Copy constructor
CUTLASS_HOST_DEVICE
Iterator(Iterator const &it) : vec_(it.vec_), bit_(it.bit_) {}
/// Constructs an iterator from a PredicateVector
CUTLASS_HOST_DEVICE
Iterator(PredicateVector &_vec, int _start = 0) : vec_(_vec), bit_(_start) {}
/// Pre-increment
CUTLASS_HOST_DEVICE
Iterator &operator++() {
++bit_;
return *this;
}
/// Pre-decrement
CUTLASS_HOST_DEVICE
Iterator &operator--() {
--bit_;
return *this;
}
/// Post-increment
CUTLASS_HOST_DEVICE
Iterator operator++(int) {
Iterator ret(*this);
ret.bit_++;
return ret;
}
/// Post-decrement
CUTLASS_HOST_DEVICE
Iterator operator--(int) {
Iterator ret(*this);
ret.bit_--;
return ret;
}
/// Returns true if iterators point to the same bit
CUTLASS_HOST_DEVICE
bool operator==(Iterator const &it) const { return bit_ == it.bit_; }
/// Returns false if iterators point to the same bit
CUTLASS_HOST_DEVICE
bool operator!=(Iterator const &it) const { return bit_ != it.bit_; }
/// Gets the bit at the pointed to location
CUTLASS_HOST_DEVICE
bool get() { return vec_[bit_]; }
/// Dereferences iterator
CUTLASS_HOST_DEVICE
bool operator*() const { return vec_[bit_]; }
/// Sets the bit at the pointed to location
CUTLASS_HOST_DEVICE
void set(bool value = true) { vec_.set(bit_, value); }
};
/// Iterator that always returns true
struct TrivialIterator {
/// Constructor
CUTLASS_HOST_DEVICE
TrivialIterator() {}
/// Copy constructor
CUTLASS_HOST_DEVICE
TrivialIterator(Iterator const &it) {}
/// Constructs an iterator from a PredicateVector
CUTLASS_HOST_DEVICE
TrivialIterator(PredicateVector const &_vec) {}
/// Pre-increment
CUTLASS_HOST_DEVICE
TrivialIterator &operator++() { return *this; }
/// Post-increment
CUTLASS_HOST_DEVICE
TrivialIterator operator++(int) { return *this; }
/// Dereferences iterator
CUTLASS_HOST_DEVICE
bool operator*() const { return true; }
};
public:
//
// Methods
//
/// Initialize the predicate vector
CUTLASS_HOST_DEVICE PredicateVector(bool value = true) { fill(value); }
/// Fills all predicates with a given value
CUTLASS_HOST_DEVICE void fill(bool value = true) {
Storage item = (value ? ~Storage(0) : Storage(0));
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < kWordCount; ++i) {
storage(i) = item;
}
}
/// Accesses a bit within the predicate vector.
CUTLASS_HOST_DEVICE bool operator[](int idx) const { return at(idx); }
/// Accesses a bit within the predicate vector.
CUTLASS_HOST_DEVICE bool at(int idx) const {
int bit, word;
computeStorageOffset(word, bit, idx);
return ((storage(word) >> bit) & 1);
}
/// Set a bit within the predicate vector.
CUTLASS_HOST_DEVICE void set(int idx, bool value = true) {
int bit, word;
computeStorageOffset(word, bit, idx);
Storage disable_mask = (~(Storage(1) << bit));
Storage enable_mask = (Storage(value) << bit);
storage(word) = ((storage(word) & disable_mask) | enable_mask);
}
/// Computes the intersection of two identical predicate vectors.
CUTLASS_HOST_DEVICE PredicateVector &operator&=(PredicateVector const &predicates) {
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < kWordCount; ++i) {
storage(i) = (storage(i) & predicates.storage(i));
}
return *this;
}
/// Computes the union of two identical predicate vectors.
CUTLASS_HOST_DEVICE PredicateVector &operator|=(PredicateVector const &predicates) {
CUTLASS_PRAGMA_UNROLL
for (int i = 0; i < kWordCount; ++i) {
storage(i) = (storage(i) | predicates.storage(i));
}
return *this;
}
/// Returns true if entire predicate array is zero.
CUTLASS_HOST_DEVICE bool is_zero() const {
Storage mask(0);
for (int byte = 0; byte < sizeof(Storage); ++byte) {
Storage byte_mask = (((1 << kPredicatesPerByte) - 1) << kPredicateStart);
mask |= (byte_mask << (byte * 8));
}
uint32_t result = 0;
for (int word = 0; word < kWordCount; ++word) {
result |= storage(word);
}
return result == 0;
}
/// Returns an iterator to the start of the bit vector
CUTLASS_DEVICE
Iterator begin() { return Iterator(*this); }
/// Returns an iterator
CUTLASS_DEVICE
Iterator end() { return Iterator(*this, kPredicates); }
/// Returns a ConstIterator
CUTLASS_DEVICE
ConstIterator const_begin() const { return ConstIterator(*this); }
/// Returns a ConstIterator
CUTLASS_DEVICE
ConstIterator const_end() const { return ConstIterator(*this, kPredicates); }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Always returns true predicate.
struct TrivialPredicateTileAdapter {
/// Ctor.
CUTLASS_HOST_DEVICE TrivialPredicateTileAdapter() {}
/// The value at location (d, h, w, c).
CUTLASS_HOST_DEVICE bool at(int, int, int, int) const { return true; }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to enable random access to predicates via logical coordinate within a tile.
template <typename PredicateVector_, typename Iterations_>
struct PredicateTileAdapter {
/// The vector of predicates.
typedef PredicateVector_ PredicateVector;
/// The iterations.
typedef Iterations_ Iterations;
private:
/// The predicates.
PredicateVector &predicates;
public:
/// Ctor.
CUTLASS_DEVICE PredicateTileAdapter(PredicateVector &predicates_) : predicates(predicates_) {}
/// Get the value at location (d, h, w, c).
CUTLASS_DEVICE bool at(int d, int h, int w, int c) const {
int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
return predicates.at(bit);
}
/// Set the value at location (d, h, w, c).
CUTLASS_DEVICE void set(int d, int h, int w, int c, bool value) {
int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
predicates.set(bit, value);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to enable random access to predicates via logical coordinate within a tile.
template <typename PredicateVector_, typename Iterations_>
struct ConstPredicateTileAdapter {
/// The vector of predicates.
typedef PredicateVector_ PredicateVector;
/// The iterations.
typedef Iterations_ Iterations;
private:
/// The predicates.
PredicateVector const &predicates;
public:
/// Ctor.
CUTLASS_DEVICE ConstPredicateTileAdapter(PredicateVector const &predicates_)
: predicates(predicates_) {}
/// Get the value at location (d, h, w, c).
CUTLASS_DEVICE bool at(int d, int h, int w, int c) const {
int const bit = ComputeOffsetFromShape<Iterations>::get(d, h, w, c);
return predicates.at(bit);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,58 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines a type for restructuring a tile.
*/
#pragma once
#include <cutlass/shape.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
// The following functor reshapes a tile of data. The goal is to have at least kAccessSize in
// the inner-most dimension. If the user respects that constraint, there is nothing to be done. If
// that's not the case, this functor will correct that and "extract" the right number of elements
// from the next dimension.
template <typename Tile_, int kAccessSize_, bool = (Tile_::kC < kAccessSize_)>
struct ReshapeTile {
typedef Tile_ Tile;
};
template <typename Tile_, int kAccessSize_>
struct ReshapeTile<Tile_, kAccessSize_, true> {
// Make sure the W dimension of the tile is large enough.
static_assert(Tile_::kW >= kAccessSize_, "The W dimension is too small");
// Make sure the dimension can be divided by the number of scalars.
static_assert(Tile_::kW % kAccessSize_ == 0, "Not supported");
// Collapse the W dimension.
typedef Shape<Tile_::kD, Tile_::kH, Tile_::kW / kAccessSize_, kAccessSize_> Tile;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,301 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines Shape implementing the Layout concept for representing a 4D hypercube of objects.
*/
#pragma once
#include <cutlass/cutlass.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup layout_concept Layout Concept
* @{
* @par Implementations of \ref layout_concept are used to describe a cube with DxHxW elements and C
scalars per element.
A HxW slice of a cube is called an image and a cube consists of D images.
*
* @par Notations
* Let Layout be an implementation of the \ref layout_concept.
*
* @par Valid Expressions
* - <b>Layout::D</b> specifies the depth of a cube
* - <b>Layout::H</b> specifies the height of a cube
* - <b>Layout::W</b> specifies the height of a cube
* - <b>Layout::C</b> specifies the number of channels of each element in a cube
* - <b>Layout::W_c</b> specifies the number of scalars of each row in one image of a cube.
* - <b>Layout::H_w</b> specifies the number of elements in an image slice.
* - <b>Layout::H_w_c</b>_specifies the number of scalars in an image slice.
* - <b>Layout::D_h_w</b> specifies the number of elements in a cube.
* - <b>Layout::D_h_w_c</b> specifies the number of scalars in a cube.
* - <b>Layout::Strides</b> is a \ref layout_concept specifying the strides.
* @}
*/
/**
* @brief A Shape implementing \ref layout_concept describing the dimensions of a cube.
* @concept{layout_concept}
*/
template <int kD_ = 1, int kH_ = 1, int kW_ = 1, int kC_ = 1>
struct Shape {
/// The depth of the cube.
static int const kD = kD_;
/// The height of the cube.
static int const kH = kH_;
/// The width of the cube.
static int const kW = kW_;
/// The number of scalars per element.
static int const kC = kC_;
};
/**
* @brief Compute derived counted of a \ref layout_concept based class
*/
template <typename Shape>
struct ShapeCount {
/// The number of elements per row.
static int const kWc = Shape::kW * Shape::kC;
/// The number of pixels per image.
static int const kHw = Shape::kH * Shape::kW;
/// The number of elements per image.
static int const kHwc = Shape::kH * kWc;
/// The number of pixels per cube.
static int const kDhw = Shape::kD * kHw;
/// The number of elements in the 4D space.
static int const kDhwc = Shape::kD * kHwc;
/// The number of elements in the 4D space.
static int const kCount = kDhwc;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, int kScale_>
struct ShapeScale {
typedef Shape<A_::kD * kScale_, A_::kH * kScale_, A_::kW * kScale_, A_::kC * kScale_> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeAdd {
typedef Shape<A_::kD + B_::kD, A_::kH + B_::kH, A_::kW + B_::kW, A_::kC + B_::kC> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeSub {
typedef Shape<A_::kD - B_::kD, A_::kH - B_::kH, A_::kW - B_::kW, A_::kC - B_::kC> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeMul {
typedef Shape<A_::kD * B_::kD, A_::kH * B_::kH, A_::kW * B_::kW, A_::kC * B_::kC> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeDiv {
typedef Shape<A_::kD / B_::kD, A_::kH / B_::kH, A_::kW / B_::kW, A_::kC / B_::kC> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeMax {
typedef Shape<(A_::kD > B_::kD ? A_::kD : B_::kD),
(A_::kH > B_::kH ? A_::kH : B_::kH),
(A_::kW > B_::kW ? A_::kW : B_::kW),
(A_::kC > B_::kC ? A_::kC : B_::kC)>
Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename A_, typename B_>
struct ShapeMin {
typedef Shape<(A_::kD < B_::kD ? A_::kD : B_::kD),
(A_::kH < B_::kH ? A_::kH : B_::kH),
(A_::kW < B_::kW ? A_::kW : B_::kW),
(A_::kC < B_::kC ? A_::kC : B_::kC)>
Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Shape_>
struct ShapeStrides {
typedef Shape<Shape_::kH * Shape_::kW * Shape_::kC, Shape_::kW * Shape_::kC, Shape_::kC, 1> Shape;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube
* @tparam A \ref layout_concept where each dimension of the cube specifies the corresponding stride.
*/
template <typename Shape_>
struct ComputeOffsetFromShape {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
// clang-format off
return d * Shape_::kH * Shape_::kW * Shape_::kC +
h * Shape_::kW * Shape_::kC +
w * Shape_::kC +
c;
// clang-format on
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube with a depth of 1
* @tparam kSh Elements in the H dimension
* @tparam kSw Elements in the W dimension
* @tparam kSc Separation between two elements in "elements"
*/
template <int kSh_, int kSw_, int kSc_>
struct ComputeOffsetFromShape<Shape<1, kSh_, kSw_, kSc_> > {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
return h * kSw_ * kSc_ + w * kSc_ + c;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube with one channel and a depth of 1
* @tparam kSh Elements in the H dimension
* @tparam kSw Elements in the W dimension
*/
template <int kSh_, int kSw_>
struct ComputeOffsetFromShape<Shape<1, kSh_, kSw_, 1> > {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) { return h * kSw_ + w; }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube
* @tparam A \ref layout_concept where each dimension of the cube specifies the corresponding stride.
*/
template <typename Strides_>
struct ComputeOffsetFromStrides {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
return d * Strides_::kD + h * Strides_::kH + w * Strides_::kW + c * Strides_::kC;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube with a depth of 1
* @tparam S_h Stride in the H dimension in scalars
* @tparam S_w Stride in the W dimension in scalars
* @tparam S_c Stride between two scalars.
*/
template <int S_h_, int S_w_, int S_c_>
struct ComputeOffsetFromStrides<Shape<1, S_h_, S_w_, S_c_> > {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) {
return h * S_h_ + w * S_w_ + c * S_c_;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Compute the offset for the given coordinates in a cube with one channel and a depth of 1
* @tparam S_h Stride in the H dimension in scalars
* @tparam S_w Stride in the W dimension in scalars
*/
template <int S_h_, int S_w_>
struct ComputeOffsetFromStrides<Shape<1, S_h_, S_w_, 1> > {
static CUTLASS_DEVICE int get(int d, int h, int w, int c) { return h * S_h_ + w * S_w_; }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief Decompose threadId.x into coordinate of a cube whose dimensions are specified by Threads_.
* Afterwards compute the offset of those coordinates using Strides_
* @tparam Threads_ The dimension of the cube the threadIdx.x value is mapped on
* @tparam Strides_ The strides to use when compute the offsets based on the coordinates of the cube.
*/
template <typename Threads_, typename Strides_>
struct ComputeThreadOffsetFromStrides {
static CUTLASS_DEVICE int get() {
// Decompose the thread index.
int c = threadIdx.x % Threads_::kC;
int w = threadIdx.x / Threads_::kC % Threads_::kW;
int h = threadIdx.x / Threads_::kC / Threads_::kW % Threads_::kH;
int d = threadIdx.x / Threads_::kC / Threads_::kW / Threads_::kH;
// Compute the offset.
return d * Strides_::kD + h * Strides_::kH + w * Strides_::kW + c * Strides_::kC;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
*@brief Specialization for D=1
*/
template <int T_h_, int T_w_, int T_c_, int S_h_, int S_w_, int S_c_>
struct ComputeThreadOffsetFromStrides<Shape<1, T_h_, T_w_, T_c_>, Shape<1, S_h_, S_w_, S_c_> > {
static CUTLASS_DEVICE int get() {
// Decompose the thread index.
int c = threadIdx.x % T_c_;
int w = threadIdx.x / T_c_ % T_w_;
int h = threadIdx.x / T_c_ / T_w_ % T_h_;
// Compute the offset.
return h * S_h_ + w * S_w_ + c * S_c_;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/**
*@brief Specialization for D=1 and C=1
*/
template <int T_h_, int T_w_, int S_h_, int S_w_>
struct ComputeThreadOffsetFromStrides<Shape<1, T_h_, T_w_, 1>, Shape<1, S_h_, S_w_, 1> > {
static CUTLASS_DEVICE int get() {
// Decompose the thread index.
int w = threadIdx.x % T_w_;
int h = threadIdx.x / T_w_;
// Compute the offset.
return h * S_h_ + w * S_w_;
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,151 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines a structure containing strides, bounds, and a pointer to tensor data.
*/
#pragma once
#include <typeinfo>
#include <cutlass/coord.h>
#include <cutlass/cutlass.h>
#include <cutlass/vector.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Structure modeling a pointer and stride into a tensor
template <typename Storage_, int Rank_>
class TensorRef {
public:
/// Data type of individual access
typedef Storage_ Storage;
/// Rank of tensor
static int const Rank = Rank_;
private:
//
// Data members
//
/// Pointer to storage element
Storage* ptr_;
/// Stride information
Coord<Rank> stride_;
public:
//
// Methods
//
/// Default ctor
CUTLASS_HOST_DEVICE
TensorRef() : ptr_(nullptr) {}
/// Constructs from a pointer, size, and stride
CUTLASS_HOST_DEVICE
TensorRef(Storage* ptr, Coord<Rank> stride) : ptr_(ptr), stride_(stride) {}
/// Updates the pointer, stride, and location within a TensorRef
CUTLASS_HOST_DEVICE
void reset(Storage* ptr = nullptr, Coord<Rank> stride = Coord<Rank>(0)) {
ptr_ = ptr;
stride_ = stride;
}
/// Conversion function
template <typename T>
TensorRef<T, Rank> convert() {
Coord<Rank> converted_stride;
for (int i = 0; i < Rank - 1; ++i) {
converted_stride[i] = stride_[i] * Extent<Storage>::kValue / Extent<T>::kValue;
}
converted_stride[Rank - 1] = stride_[Rank - 1];
return TensorRef<T, Rank>(reinterpret_cast<T*>(ptr_), converted_stride);
}
/// Returns true if the TensorRef may be safely accessed
CUTLASS_HOST_DEVICE
bool good() const { return ptr_ != nullptr; }
/// Returns the pointer to referenced data
CUTLASS_HOST_DEVICE
Storage* data() const { return ptr_; }
/// Returns the stride of the tensor
CUTLASS_HOST_DEVICE
Coord<Rank> const& stride() const { return stride_; }
/// Returns the stride of the tensor in the given dimension
CUTLASS_HOST_DEVICE
int const& stride(int dim) const { return stride_.at(dim); }
/// Returns the maximum stride element as the 'leading dimension'
CUTLASS_HOST_DEVICE
int leading_dim() const { return __NV_STD_MAX(stride_[1], stride_[2]); }
/// Computes the offset of an index from the origin of the tensor
CUTLASS_HOST_DEVICE
long long offset(Coord<Rank> const& coord) const {
return stride_.template dot<long long>(coord);
}
/// Returns a reference to the element at a given Coord
CUTLASS_HOST_DEVICE
Storage& at(Coord<Rank> const& coord) const { return ptr_[offset(coord)]; }
/// Element-wise accessor
Storage& operator[](Coord<Rank> const& coord) const { return at(coord); }
/// Returns a reference to the element at a given Coord
CUTLASS_HOST_DEVICE
Storage& at(int idx) const { return ptr_[idx]; }
/// Element-wise accessor
Storage& operator[](int idx) const { return at(idx); }
/// Adds an offset to the pointer
CUTLASS_HOST_DEVICE
TensorRef& advance(Coord<Rank> const& b) {
ptr_ += offset(b);
return *this;
}
/// Returns a TensorRef offset by a given amount
CUTLASS_HOST_DEVICE
TensorRef operator+(Coord<Rank> const& b) const { return TensorRef(ptr_ + offset(b), stride_); }
/// Returns a TensorRef offset by a given amount
CUTLASS_HOST_DEVICE
TensorRef operator-(Coord<Rank> const& b) const { return TensorRef(ptr_ - offset(b), stride_); }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,172 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines a structure containing strides and a pointer to tensor data.
*/
#pragma once
#include <cmath>
#include <cutlass/cutlass.h>
#include <cutlass/tensor_ref.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Host-side reference implementation of tensor operations
template <typename T>
class TensorView : public TensorRef<T, 4> {
public:
/// Reference and stride
typedef TensorRef<T, 4> Base;
/// Reference and stride
typedef Base TensorRef_t;
/// Reference to constant type
typedef TensorRef<T const, 4> ConstTensorRef_t;
/// Rank of tensor
static int const Rank = TensorRef_t::Rank;
/// Type used to compute the offset of an element to the base of a tensor
typedef int Offset_t;
/// Coordinate into tensor
typedef Coord<Rank> Coord_t;
private:
//
// Data members
//
/// Pointer to pitch-linear memory
TensorRef_t ref_;
/// Dimensions of coordinate (independent of stride)
Coord_t size_;
public:
//
// Device and Host Methods
//
/// Default constructor
CUTLASS_HOST_DEVICE
TensorView() {}
/// Constructs a Tensor_view from a TensorRef and size
CUTLASS_HOST_DEVICE
TensorView(TensorRef_t const& _ref, Coord_t const& _size) : Base(_ref), size_(_size) {}
/// Returns true if the Tensor_view is bound to some memory
CUTLASS_HOST_DEVICE
bool good() const { return ref().good(); }
/// Returns a pointer to data
CUTLASS_HOST_DEVICE
T* data() const { return ref().data(); }
/// Updates the reference and size of a Tensor_view object
CUTLASS_HOST_DEVICE
void reset(TensorRef_t const& _ref = TensorRef_t(0), Coord_t const& _size = Coord_t()) {
Base::operator=(_ref);
size_ = _size;
}
/// Accesses the tensor reference pointing to data
CUTLASS_HOST_DEVICE
TensorRef_t& ref() { return *this; }
///
CUTLASS_HOST_DEVICE
ConstTensorRef_t const_ref() { return ConstTensorRef_t(data(), stride()); }
/// Accesses the tensor reference pointing to data
CUTLASS_HOST_DEVICE
TensorRef_t const& ref() const { return *this; }
/// Accesses the size
CUTLASS_HOST_DEVICE
Coord_t const& size() const { return size_; }
/// Accesses the size
CUTLASS_HOST_DEVICE
int size(int dim) const { return size_.at(dim); }
/// Accesses the stride
CUTLASS_HOST_DEVICE
Coord_t const& stride() const { return ref().stride(); }
/// Accesses the stride
CUTLASS_HOST_DEVICE
int const& stride(int dim) const { return ref().stride(dim); }
/// Assigns the Tensor_view
CUTLASS_HOST_DEVICE
TensorView& operator=(TensorView const& _tensor) {
Base::operator=(_tensor._ref);
size_ = _tensor.size_;
return *this;
}
/// Returns the index of an element
CUTLASS_HOST_DEVICE
Offset_t offset(Coord_t const& coord) const { return ref().offset(coord); }
/// Determines whether a location is within a tensor
CUTLASS_HOST_DEVICE
bool contains(Coord_t const& coord) const {
for (int dim = 0; dim < Rank; ++dim) {
if (coord.at(dim) >= size_.at(dim)) {
return false;
}
}
return true;
}
/// Element-wise accessor
CUTLASS_HOST_DEVICE
T& at(Coord_t const& coord) const { return ref().at(coord); }
/// Element-wise accessor
T& operator[](Coord<Rank> const& coord) const { return at(coord); }
/// Element-wise accessor
CUTLASS_HOST_DEVICE
T& at(Offset_t idx) const { return ref().at(idx); }
/// Returns a Tensor_view given location and size quantities
CUTLASS_HOST_DEVICE
TensorView<T> subview(Coord_t const& location, Coord_t size) const {
return TensorView<T>(ref() + location, size.clamp(size_ - location));
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,881 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines the Tile Traits concept and iterators for loading and storing to tiles
efficiently.
*/
#pragma once
#include <cutlass/fragment.h>
#include <cutlass/load_store.h>
#include <cutlass/predicate_vector.h>
#include <cutlass/vector.h>
namespace cutlass {
///////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup tile_traits_concept Tile Traits Concept
@{
\ref tile_traits_concept is a type definining the shape of a tile and the distribution of accesses
by individual entities, either threads or other.
@par Tile Traits Concept
Types satisfying \ref tile_traits_concept define the following members
- <b>Tile</b> - a type satisfying \ref layout_concept describing the dimensions of the tile
- <b>Delta</b> - a type satisfying \ref layout_concept describing the increments between accesses
along each dimension
- <b>Iterations</b> - a type satisfying \ref layout_concept describing the number of accesses
along each dimension
- <b>Offset</b> - the type of a <i>functor</i> computing the offset of each participating entity
as a Coord<4>.
@}
*/
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Specifies dimension in which post-increment accesses advance
struct IteratorAdvance {
enum Kind { kD, kH, kW };
};
/// Specifies whether iterator storage fragment consists of Scalar values or WMMA matrix
struct IteratorFragment {
enum Kind { kScalar, kWmmaMatrix };
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief A template defining \ref tile_traits_concept
* @concept{tile_traits_concept}
*/
template <typename Tile_, typename Delta_, typename Iterations_, typename ThreadOffset_>
struct TileTraits {
/// Shape of the tile
typedef Tile_ Tile;
/// Number of steps between accesses along each dimension
typedef Delta_ Delta;
/// Number of accesses performed
typedef Iterations_ Iterations;
/// Functor that returns the logical coordinate of each entity's initial offset in the tile
typedef ThreadOffset_ ThreadOffset;
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Iterator for accessing a stripmined tile in memory
template <typename Traits_,
typename Scalar_,
IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
typename Index_ = int,
typename FragmentElement_ = Scalar_,
IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
typename Skew_ = Shape<0, 0, 0, 0> >
struct TileIteratorBase {
/// concept TileTraits
typedef Traits_ Traits;
/// Scalar element
typedef Scalar_ Scalar;
/// Fragment element
typedef FragmentElement_ FragmentElement;
/// Specifies dimension in which post-increment accesses advance.
static IteratorAdvance::Kind const kAdvance = Advance_;
/// Specifies iterator storage fragment type (Scalar or WmmaMatrix)
static IteratorFragment::Kind const kIteratorFragment = IteratorFragment_;
/// Source or destination memory space
static MemorySpace::Kind const kMemorySpace = MemorySpace;
/// Index type
typedef Index_ Index;
/// Skew quantity
typedef Skew_ Skew;
/// Tile shape
typedef typename Traits::Tile Tile;
/// Distance along each dimension
typedef typename Traits::Delta Delta;
/// The strides in each dimension between different loads/stores.
typedef typename Traits::ImmediateOffsetStrides ImmediateOffsetStrides;
/// Iterations
typedef typename Traits::Iterations Iterations;
/// Thread offset
typedef typename Traits::ThreadOffset ThreadOffset;
/// The number of scalars accessed per load/store.
static int const kAccessSize = Tile::kC;
/// The elements loaded/store by one instruction.
typedef typename Vectorize<FragmentElement, kAccessSize>::Type AccessType;
/// The size of storage needed per fragment
static int const kFragmentSize =
(kIteratorFragment == IteratorFragment::kWmmaMatrix ? 16 : sizeof(AccessType));
/// The storage.
typedef Fragment<Scalar, ShapeCount<Tile>::kCount, kFragmentSize> Storage;
/// The fragment.
typedef Fragment<FragmentElement, ShapeCount<Iterations>::kCount * kAccessSize> Fragment;
/// The fragment iterator.
typedef FragmentIterator<Fragment, Iterations, AccessType> FragmentIterator;
/// The fragment const iterator.
typedef FragmentConstIterator<Fragment, Iterations, AccessType> FragmentConstIterator;
/// The shape of the fragment.
typedef typename FragmentIterator::FragmentShape FragmentShape;
/// Default predicate mask type
typedef PredicateVector<ShapeCount<Iterations>::kCount> PredicateVector;
//
// Params struct
//
/// Parameters to the iterator
struct Params {
Index stride_d;
Index stride_h;
Index stride_w;
Index inc_d;
Index inc_h;
Index inc_w;
Index inc_advance;
/// Initializes params
CUTLASS_HOST_DEVICE
int initialize(Index _stride_d,
Index _stride_h,
Index _stride_w,
Index _inc_d,
Index _inc_h,
Index _inc_w,
Index _inc_advance) {
stride_d = _stride_d;
stride_h = _stride_h;
stride_w = _stride_w;
inc_d = _inc_d;
inc_h = _inc_h;
inc_w = _inc_w;
inc_advance = _inc_advance;
return 0;
}
CUTLASS_HOST_DEVICE
int initialize(Index _stride_d, Index _stride_h, Index _stride_w) {
stride_d = _stride_d;
stride_h = _stride_h;
stride_w = _stride_w;
inc_w = stride_w * Delta::kW;
inc_h = stride_h * Delta::kH - stride_w * Delta::kW * (Iterations::kW - 1);
if (kAdvance == IteratorAdvance::kH) {
// Advance in the H dimension.
inc_d = 0;
} else if (kAdvance == IteratorAdvance::kW) {
// Advance in the W dimension.
inc_d = stride_w * Tile::kW - stride_h * Tile::kH;
} else {
// Advance in the D dimension.
inc_d = stride_d;
}
inc_advance = 0;
return 0;
}
CUTLASS_HOST_DEVICE int initialize() {
stride_d = 0;
stride_h = 0;
stride_w = 1;
inc_d = inc_h = inc_w = inc_advance = 0;
return 0;
}
};
/// Is the iterator valid?
CUTLASS_DEVICE bool valid(int d, int h, int w, int c) const { return true; }
//
// Static function members
//
/// Initializes a predicate vector
template <typename PredicateIterator>
CUTLASS_DEVICE static void initialize_predicates(PredicateIterator predicate_it,
Coord<3> const &bounds,
Coord<3> const &offset = make_Coord(0, 0, 0)) {
for (int d = 0; d < Iterations::kD; ++d) {
bool enable_d = (d * Delta::kD + offset[0] < bounds[0]);
for (int h = 0; h < Iterations::kH; ++h) {
bool enable_h = (h * Delta::kH + offset[1] < bounds[1]);
for (int w = 0; w < Iterations::kW; ++w) {
bool enable_w = (w * Tile::kC * Delta::kW + offset[2] < bounds[2]);
predicate_it.set(d, h, w, 0, enable_d && enable_h && enable_w);
}
}
}
}
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup tile_load_iterator_concept Tile Load Iterator Concept
@{
\ref tile_load_iterator_concept enables loading a tile from addressable memory into a fragment
@par Tile Load Iterator Concept
Types satisfying \ref tile_load_iterator_concept define the following members
- <b>PredicateVector</b> - a \ref predicate_vector_concept with sufficient predicate storage for
each access implied by the tile traits
- <b>Fragment</b> - the destination fragment type satisfying \ref fragment_concept
- <b>initialize_predicates(pred_it, bounds, block_offset)</b> - function initializing a predicate
vector according to externally specified bounds
- <b>load_post_increment(fragment, pred_it)</b> - a method that loads a fragment and increments
the iterator to the next tile, guarded by a \ref predicate_iterator_concept
- <b>load_post_increment(fragment)</b> - a method that loads a fragment and increments the
iterator to the next tile
- <b>load(fragment, pred_it)</b> - a const method that loads a fragment, guarded by a \ref
predicate_iterator_concept
- <b>load(fragment)</b> - a method that loads a fragment
@}
*/
///////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief An iterator implementing \ref tile_load_iterator_concept for loading a tile from memory
* @concept{tile_load_iterator_concept}
*/
template <typename Traits_,
typename Scalar_,
IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
typename Index_ = int,
typename FragmentElement_ = Scalar_,
IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
typename Skew_ = Shape<0, 0, 0, 0> >
struct TileLoadIterator : public TileIteratorBase<Traits_,
Scalar_,
Advance_,
MemorySpace,
Index_,
FragmentElement_,
IteratorFragment_,
Skew_> {
/// Base class
typedef TileIteratorBase<Traits_,
Scalar_,
Advance_,
MemorySpace,
Index_,
FragmentElement_,
IteratorFragment_,
Skew_>
Base;
/// concept TileTraits
typedef typename Base::Traits Traits;
/// Scalar element
typedef typename Base::Scalar Scalar;
/// Fragment element
typedef typename Base::FragmentElement FragmentElement;
/// Specifies in which dimension post-increment accesses advance.
static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
/// Specifies type of iterator fragment storage (Salar or WmmaMatrix)
static IteratorFragment::Kind const kIteratorFragment = Base::kIteratorFragment;
/// Source or destination memory space
static MemorySpace::Kind const kMemorySpace = Base::kMemorySpace;
/// Index type
typedef typename Base::Index Index;
/// Skew quantity
typedef typename Base::Skew Skew;
/// Tile shape
typedef typename Base::Tile Tile;
/// Delta
typedef typename Base::Delta Delta;
/// Iterations
typedef typename Base::Iterations Iterations;
/// ThreadOffset functor
typedef typename Base::ThreadOffset ThreadOffset;
/// Fragment type
typedef typename Base::FragmentShape FragmentShape;
/// Memory access type
typedef typename Base::AccessType AccessType;
/// Fragment definition
typedef typename Base::Fragment Fragment;
/// Fragment iterator definition
typedef typename Base::FragmentIterator FragmentIterator;
/// Fragment const iterator definition
typedef typename Base::FragmentConstIterator FragmentConstIterator;
/// Default predicate mask type
typedef typename Base::PredicateVector PredicateVector;
/// Storage object that may be loaded from
typedef typename Base::Storage SharedStorage;
/// IteratorBase parameters
typedef typename Base::Params BaseParams;
/// Do we require a fence?
enum { kRequiresLoadFence = Tile::kD == 1 };
/// The pointer type
typedef Scalar const *Pointer;
/// Parameters
struct Params : public BaseParams {
/// Pointer to memory
Scalar const *pointer;
/// Initialize params to access storage object
CUTLASS_HOST_DEVICE
int initialize(SharedStorage const &storage) {
pointer = &storage[0];
return 0;
}
/// Initializes params to access a raw pointer
CUTLASS_HOST_DEVICE
int initialize(Scalar const *ptr, Index stride_d, Index stride_h, Index stride_w) {
Base::Params::initialize(stride_d, stride_h, stride_w);
pointer = ptr;
return 0;
}
/// Initializes params
CUTLASS_HOST_DEVICE
int initialize(Scalar const *ptr,
Index _stride_d,
Index _stride_h,
Index _stride_w,
Index _inc_d,
Index _inc_h,
Index _inc_w,
Index _inc_advance) {
pointer = ptr;
Base::Params::initialize(
_stride_d, _stride_h, _stride_w, _inc_d, _inc_h, _inc_w, _inc_advance);
return 0;
}
// Initializes params to default values
CUTLASS_HOST_DEVICE
int initialize() { return Base::Params::initialize(); }
};
//
// Data members
//
/// Parameters structure
Params params;
/// Offset of an individual lane from the start of the tile
Coord<4> thread_offset;
/// Stage argument enables wrapping after some number of tiles have been loaded.
int stage;
//
// Static member functions
//
/// Initializes a predicate vector
template <typename PredicateIterator>
CUTLASS_HOST_DEVICE void initialize_predicates(PredicateIterator predicate_it,
Coord<3> const &bounds,
Coord<3> const &block_offset = make_Coord(0,
0,
0)) {
Base::initialize_predicates(
predicate_it,
bounds,
block_offset + make_Coord(0, thread_offset[1], thread_offset[2] * Tile::kC));
}
//
// Methods
//
/// Default constructor
CUTLASS_HOST_DEVICE
TileLoadIterator() {}
/// Constructs a tile load iterator
CUTLASS_HOST_DEVICE
TileLoadIterator(Params const &_params,
Coord<3> const &block_offset = make_Coord(0, 0, 0),
ThreadOffset thread_offset_func = ThreadOffset())
: params(_params), stage(0) {
thread_offset = thread_offset_func();
Index block_offset_h = 0;
Index block_offset_w = 0;
if (kAdvance == IteratorAdvance::kH) {
block_offset_h = block_offset[1];
block_offset_w = block_offset[2];
} else {
block_offset_h = block_offset[2];
block_offset_w = block_offset[1];
}
params.pointer += block_offset[0] * params.stride_d +
(block_offset_h + thread_offset[1]) * params.stride_h +
(block_offset_w + thread_offset[2] * Tile::kC) / Tile::kC * params.stride_w;
}
/// Constructs a tile load iterator
CUTLASS_HOST_DEVICE
TileLoadIterator(Params const &,
SharedStorage &shared_storage,
Coord<3> const &block_offset = make_Coord(0, 0, 0),
ThreadOffset thread_offset_func = ThreadOffset())
: stage(0) {
int const offset = thread_offset_func()[2];
params.pointer = &shared_storage[offset];
}
/// Returns the current pointer
CUTLASS_HOST_DEVICE
Scalar const *data() const { return params.pointer; }
/// Increment in the D dimension
CUTLASS_HOST_DEVICE void inc_d() { params.pointer += params.inc_d; }
/// Increment in the H dimension
CUTLASS_HOST_DEVICE void inc_h() { params.pointer += params.inc_h; }
/// Increment in the W dimension
CUTLASS_HOST_DEVICE void inc_w() { params.pointer += params.inc_w; }
/// Increment in the next dimension
CUTLASS_HOST_DEVICE void inc_advance() { params.pointer += params.inc_advance; }
/// Increment the stage.
CUTLASS_DEVICE void inc_stage() {
if (Tile::kD > 1) {
int const kStageSize = Tile::kH * Tile::kW * Tile::kC;
if (stage == Tile::kD - 1) {
params.pointer -= (Tile::kD - 1) * kStageSize;
stage = 0;
} else {
params.pointer += kStageSize;
stage = stage + 1;
}
}
}
public:
/// Loads a fragment and advances the iterator to the next tile.
template <typename Fragment, typename PredicateIterator>
CUTLASS_HOST_DEVICE void load_post_increment(Fragment &fragment, PredicateIterator pred_it) {
FragmentIterator frag_iterator(fragment);
for (int d = 0; d < Iterations::kD; ++d) {
for (int h = 0; h < Iterations::kH; ++h) {
for (int w = 0; w < Iterations::kW; ++w, ++pred_it) {
if (*pred_it) {
Load<typename Fragment::Element, Tile::kC, kMemorySpace>::load(
reinterpret_cast<AccessType &>(frag_iterator.at(d, h, w, 0)), data(), 0);
}
if (w < Iterations::kW - 1) {
inc_w();
}
}
if (h < Iterations::kH - 1) {
inc_h();
}
}
if (d < Iterations::kD - 1) {
inc_d();
}
}
inc_advance();
}
/// Loads a fragment and advances the iterator to the next tile.
template <typename Fragment>
CUTLASS_HOST_DEVICE void load_post_increment(Fragment &fragment) {
typename PredicateVector::TrivialIterator pred_it;
load_post_increment(fragment, pred_it);
}
/// Loads a fragment without advancing the iterator..
template <typename Fragment, typename PredicateIterator>
CUTLASS_HOST_DEVICE void load(Fragment &fragment, PredicateIterator pred_it) const {
TileLoadIterator _load_it(*this);
_load_it.load_post_increment(fragment, pred_it);
}
/// Loads a fragment without advancing the iterator..
template <typename Fragment>
CUTLASS_HOST_DEVICE void load(Fragment &fragment) const {
typename PredicateVector::TrivialIterator pred_it;
load(fragment, pred_it);
}
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/*!@defgroup tile_store_iterator_concept Tile Store Iterator Concept
@{
\ref tile_store_iterator_concept enables storing a tile to addressable memory
@par Tile Store Iterator Concept
Types satisfying \ref tile_load_iterator_concept define the following members
- <b>PredicateVector</b> - a \ref predicate_vector_concept with sufficient predicate storage for
each access implied by the tile traits
- <b>Fragment</b> - the destination fragment type satisfying \ref fragment_concept
- <b>initialize_predicates(pred_it, bounds, block_offset)</b> - function initializing a predicate
vector according to externally specified bounds
- <b>store_post_increment(fragment, pred_it)</b> - a method that stores a fragment and increments
the iterator to the next tile, guarded by a \ref predicate_iterator_concept
- <b>store_post_increment(fragment)</b> - a method that stores a fragment and increments the
iterator to the next tile
- <b>store(fragment, pred_it)</b> - a const method that stores a fragment, guarded by a \ref
predicate_iterator_concept
- <b>store(fragment)</b> - a method that loads a fragment
@}
*/
///////////////////////////////////////////////////////////////////////////////////////////////////
/**
* @brief An iterator implementing \ref tile_store_iterator_concept for storing a tile to memory
* @concept{tile_store_iterator_concept}
*/
template <typename Traits_,
typename Scalar_,
IteratorAdvance::Kind Advance_ = IteratorAdvance::kH,
MemorySpace::Kind MemorySpace = MemorySpace::kGeneric,
typename Index_ = int,
typename FragmentElement_ = Scalar_,
IteratorFragment::Kind IteratorFragment_ = IteratorFragment::kScalar,
typename Skew_ = Shape<0, 0, 0, 0> >
struct TileStoreIterator : public TileIteratorBase<Traits_,
Scalar_,
Advance_,
MemorySpace,
Index_,
FragmentElement_,
IteratorFragment_,
Skew_> {
/// Base class
typedef TileIteratorBase<Traits_,
Scalar_,
Advance_,
MemorySpace,
Index_,
FragmentElement_,
IteratorFragment_,
Skew_>
Base;
/// concept TileTraits
typedef typename Base::Traits Traits;
/// Scalar element
typedef typename Base::Scalar Scalar;
/// Fragment element
typedef typename Base::FragmentElement FragmentElement;
/// Specifies in which dimension post-increment accesses advance.
static IteratorAdvance::Kind const kAdvance = Base::kAdvance;
/// Specifies type of iterator fragment storage (Salar or WmmaMatrix)
static IteratorFragment::Kind const kIteratorFragment = Base::kIteratorFragment;
/// Source or destination memory space
static MemorySpace::Kind const kMemorySpace = Base::kMemorySpace;
/// Index type
typedef typename Base::Index Index;
/// Skew quantity
typedef typename Base::Skew Skew;
/// Tile shape
typedef typename Base::Tile Tile;
/// Delta
typedef typename Base::Delta Delta;
/// Iterations
typedef typename Base::Iterations Iterations;
/// ThreadOffset functor
typedef typename Base::ThreadOffset ThreadOffset;
/// Fragment type
typedef typename Base::FragmentShape FragmentShape;
/// Memory access type
typedef typename Base::AccessType AccessType;
/// Fragment definition
typedef typename Base::Fragment Fragment;
/// Fragment iterator definition
typedef typename Base::FragmentIterator FragmentIterator;
/// Fragment const iterator definition
typedef typename Base::FragmentConstIterator FragmentConstIterator;
/// Default predicate mask type
typedef typename Base::PredicateVector PredicateVector;
/// Storage object which may be stored to
typedef typename Base::Storage SharedStorage;
/// IteratorBase parameters
typedef typename Base::Params BaseParams;
/// Parameters
struct Params : public BaseParams {
/// Pointer to memory
Scalar *pointer;
/// Initialize params to access storage object
CUTLASS_HOST_DEVICE
int initialize(SharedStorage &storage) {
pointer = &storage[0];
return 0;
}
/// Initializes params to access a raw pointer
CUTLASS_HOST_DEVICE
int initialize(Scalar *ptr, Index stride_d, Index stride_h, Index stride_w) {
Base::Params::initialize(stride_d, stride_h, stride_w);
pointer = ptr;
return 0;
}
/// Initializes params
CUTLASS_HOST_DEVICE
int initialize(Scalar *ptr,
Index _stride_d,
Index _stride_h,
Index _stride_w,
Index _inc_d,
Index _inc_h,
Index _inc_w,
Index _inc_advance) {
pointer = ptr;
Base::Params::initialize(
_stride_d, _stride_h, _stride_w, _inc_d, _inc_h, _inc_w, _inc_advance);
return 0;
}
/// Initializes params to default values
CUTLASS_HOST_DEVICE
int initialize() { return Base::Params::initialize(); }
};
//
// Data members
//
/// Parameters structure
Params params;
/// Offset of an individual lane from the start of the tile
Coord<4> thread_offset;
/// The stage.
int stage;
//
// Static member functions
//
/// Initializes a predicate vector
template <typename PredicateIterator>
CUTLASS_HOST_DEVICE void initialize_predicates(PredicateIterator predicate_it,
Coord<3> const &bounds,
Coord<3> const &block_offset = make_Coord(0,
0,
0)) {
Base::initialize_predicates(
predicate_it,
bounds,
block_offset + make_Coord(0, thread_offset[1], thread_offset[2] * Tile::kC));
}
//
// Methods
//
/// Default constructor
CUTLASS_HOST_DEVICE
TileStoreIterator() {}
/// Constructs a tile store iterator
CUTLASS_HOST_DEVICE
TileStoreIterator(Params const &_params,
Coord<3> const &block_offset = make_Coord(0, 0, 0),
ThreadOffset thread_offset_func = ThreadOffset())
: params(_params), stage(0) {
thread_offset = thread_offset_func();
params.pointer += block_offset[0] * params.stride_d +
(block_offset[1] + thread_offset[1]) * params.stride_h +
(block_offset[2] + thread_offset[2] * Tile::kC) / Tile::kC * params.stride_w;
}
/// Constructs a tile store iterator
CUTLASS_HOST_DEVICE
TileStoreIterator(Params const &,
SharedStorage &shared_storage,
Coord<3> const &block_offset = make_Coord(0, 0, 0),
ThreadOffset thread_offset_func = ThreadOffset())
: stage(0) {
int const offset = thread_offset_func()[2];
params.pointer = &shared_storage[offset];
}
/// Returns the current pointer
CUTLASS_HOST_DEVICE
Scalar *data() const { return params.pointer; }
/// Increment in the D dimension
CUTLASS_HOST_DEVICE void inc_d() { params.pointer += params.inc_d; }
/// Increment in the H dimension
CUTLASS_HOST_DEVICE void inc_h() { params.pointer += params.inc_h; }
/// Increment in the W dimension
CUTLASS_HOST_DEVICE void inc_w() { params.pointer += params.inc_w; }
/// Increment in the next dimension
CUTLASS_HOST_DEVICE void inc_advance() {}
/// Increment the stage.
CUTLASS_DEVICE void inc_stage() {
if (Tile::kD > 1) {
int const kStageSize = Tile::kH * Tile::kW * Tile::kC;
if (stage == Tile::kD - 1) {
params.pointer -= (Tile::kD - 1) * kStageSize;
stage = 0;
} else {
params.pointer += kStageSize;
stage = stage + 1;
}
}
}
public:
/// Stores a fragment and advances to the next tile.
template <typename Fragment, typename PredicateIterator>
CUTLASS_HOST_DEVICE void store_post_increment(Fragment &fragment, PredicateIterator pred_it) {
FragmentIterator frag_iterator(fragment);
for (int d = 0; d < Iterations::kD; ++d) {
for (int h = 0; h < Iterations::kH; ++h) {
for (int w = 0; w < Iterations::kW; ++w, ++pred_it) {
if (*pred_it) {
Store<typename Fragment::Element, Tile::kC, kMemorySpace>::store(
reinterpret_cast<AccessType &>(frag_iterator.at(d, h, w, 0)), data(), 0);
}
if (w < Iterations::kW - 1) {
inc_w();
}
}
if (h < Iterations::kH - 1) {
inc_h();
}
}
if (d < Iterations::kD - 1) {
inc_d();
}
}
inc_advance();
}
/// Stores a fragment and advances to the next tile.
template <typename Fragment>
CUTLASS_HOST_DEVICE void store_post_increment(Fragment &fragment) {
typename PredicateVector::TrivialIterator pred_it;
store_post_increment(fragment, pred_it);
}
/// Stores a fragment without advancing the iterator.
template <typename Fragment, typename PredicateIterator>
CUTLASS_HOST_DEVICE void store(Fragment &fragment, PredicateIterator pred_it) const {
TileStoreIterator _store_it(*this);
_store_it.store_post_increment(fragment, pred_it);
}
/// Stores a fragment without advancing the iterator.
template <typename Fragment>
CUTLASS_HOST_DEVICE void store(Fragment &fragment) const {
typename PredicateVector::TrivialIterator pred_it;
store(fragment, pred_it);
}
};
}

View File

@ -1,238 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines tile traits for several tile partitioning arrangements of threads expected to
achieve efficient streaming performance.
*/
#pragma once
#include <cutlass/tile_iterator.h>
namespace cutlass {
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Basic thread offset function computed from a thread shape
template <typename ThreadShape>
struct TiledThreadOffset {
/// Computes the logical coordinate from thread shape
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
Coord<4> thread_offset;
int index = threadIdx.x;
thread_offset[3] = (index % ThreadShape::kC);
index = (index / ThreadShape::kC);
thread_offset[2] = (index % ThreadShape::kW);
index = (index / ThreadShape::kW);
thread_offset[1] = (index % ThreadShape::kH);
index = (index / ThreadShape::kH);
thread_offset[0] = index;
return thread_offset;
}
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Tiling in which the number of threads is greater than the
/// contiguous dimension of the tile.
template <typename Tile_, int Threads>
struct TileTraitsStrideMajor {
/// Shape of tile
typedef Tile_ Tile;
/// Number of participating threads
static int const kThreads = Threads;
// Static assertions
static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
"Tiling undefined if elements not divisible by threads.");
static_assert(Tile::kW <= kThreads,
"This specialization assumes there are more threads than the contiguous dimension "
"of the tile.");
/// Shape of threads
typedef Shape<1, kThreads / Tile::kW, Tile::kW, 1> ThreadShape;
/// Delta along each dimension
typedef Shape<1, ThreadShape::kH, 1, 1> Delta;
/// Number of iterations
typedef Shape<1, Tile::kH / ThreadShape::kH, 1, 1> Iterations;
/// Computes the initial offset
typedef TiledThreadOffset<ThreadShape> ThreadOffset;
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Tiling in which the number of threads is fewer than the tile size
/// in the contiguous dimension.
template <typename Tile_, int Threads>
struct TileTraitsContiguousMajor {
/// Shape of tile
typedef Tile_ Tile;
/// Number of participating threads
static int const kThreads = Threads;
// Static assertions
static_assert(Tile::kW >= kThreads,
"This specialization assumes there are more threads than the contiguous dimension "
"of the tile.");
static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
"Tiling undefined if elements not divisible by threads.");
static_assert(!(Tile::kW % kThreads),
"The contiguous size of the tile must be divisible by the number of threads.");
/// Thread shape
typedef Shape<1, 1, kThreads> ThreadShape;
/// Delta between each thread's access
typedef Shape<1, 1, kThreads> Delta;
/// Number of iterations
typedef Shape<1, Tile::kH, Tile::kW / kThreads> Iterations;
/// Computes the initial offset
typedef TiledThreadOffset<ThreadShape> ThreadOffset;
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Tiling in which warps rake across the contiguous dimension
template <typename Tile_, int Threads>
struct TileTraitsWarpRake {
/// Shape of tile
typedef Tile_ Tile;
/// Number of participating threads
static int const kThreads = Threads;
/// Hard-coded warp size
static int const kWarpSize = 32;
/// Number of participating warps
static int const kWarpCount = kThreads / kWarpSize;
// Static assertions
static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
"Tiling undefined if elements not divisible by threads.");
static_assert(!(kThreads % kWarpSize), "Number of threads must be divisible by the warp size.");
static_assert(!(Tile::kW % kWarpSize), "Contiguous dimension must be divisible by the warp size");
/// Warps strip-mined across strided dimension
static int const kWarpsStrided = __NV_STD_MIN(kWarpCount, Tile::kH);
/// Warps stripmined contiguous dimension
static int const kWarpsContiguous = kWarpCount / kWarpsStrided;
/// Arrangement of threads
typedef Shape<1, kWarpsStrided, kWarpsContiguous * kWarpSize> ThreadShape;
/// The same warp rakes along the contiguous dimension
typedef Shape<1, kWarpsStrided, kWarpSize> Delta;
/// Number of iterations
typedef Shape<1, Tile::kH / Delta::kH, Tile::kW / ThreadShape::kW> Iterations;
/// Computes the thread offset in (H, W) based on thread ID
struct ThreadOffset {
/// Basic thread offset function computed from a thread shape
CUTLASS_HOST_DEVICE
Coord<4> operator()() const {
int tid = threadIdx.x;
int warp = (tid / kWarpSize);
int lane = (tid % kWarpSize);
static int const kWarpSpanContiguous = kWarpSize * Iterations::kW;
int warp_w = (warp % kWarpsContiguous);
int warp_h = (warp / kWarpsContiguous);
return make_Coord(0, warp_h, lane + kWarpSpanContiguous * warp_w, 0);
}
};
};
///////////////////////////////////////////////////////////////////////////////////////////////////
/// Chooses 'best' shape to enable warp raking along contiguous dimension if possible.
template <typename Tile_, int Threads>
struct TileTraitsStandard {
/// Shape of tile
typedef Tile_ Tile;
/// Number of participating threads
static int const kThreads = Threads;
/// Hard-coded warp size
static int const kWarpSize = 32;
/// Number of participating warps
static int const kWarpCount = kThreads / kWarpSize;
// Static assertions
static_assert(!(ShapeCount<Tile>::kDhw % kThreads),
"Tiling undefined if elements not divisible by threads.");
/// Choose the stride-major contiguous tiling if the contiguous dimension is
/// smaller than the warp size. Otherwise, if it is divisible by the warp size,
/// choose the warp rake arrangement.
typedef typename platform::conditional <
Tile::kW<kWarpSize,
TileTraitsStrideMajor<Tile, Threads>,
typename platform::conditional<!(Tile::kW % kWarpSize),
TileTraitsWarpRake<Tile, Threads>,
TileTraitsContiguousMajor<Tile, Threads> >::type>::
type Traits;
/// Delta between accesses
typedef typename Traits::Delta Delta;
/// Delta between each thread's access
/// TODO MTA this is wrong for sure, but Delta is used for stride computation at the moment
typedef Delta ImmediateOffsetStrides;
/// Number of accesses
typedef typename Traits::Iterations Iterations;
/// Thread offset functor
typedef typename Traits::ThreadOffset ThreadOffset;
};
///////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,131 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
/**
* \file
* \brief Math utilities
*/
#include <cutlass/util/platform.h>
namespace cutlass {
/******************************************************************************
* Static math utilities
******************************************************************************/
/**
* Statically determine if N is a power-of-two
*/
template <int N>
struct is_pow2 : platform::integral_constant<bool, (N & (N - 1)) == 0> {};
/**
* Statically determine log2(N), rounded down
*/
template <int N, int CurrentVal = N, int Count = 0>
struct log2_down {
/// Static logarithm value
enum { value = log2_down<N, (CurrentVal >> 1), Count + 1>::value };
};
// Base case
template <int N, int Count>
struct log2_down<N, 1, Count> {
enum { value = Count };
};
/**
* Statically determine log2(N), rounded up
*/
template <int N, int CurrentVal = N, int Count = 0>
struct log2_up {
/// Static logarithm value
enum { value = log2_up<N, (CurrentVal >> 1), Count + 1>::value };
};
// Base case
template <int N, int Count>
struct log2_up<N, 1, Count> {
enum { value = ((1 << Count) < N) ? Count + 1 : Count };
};
/**
* Statically estimate sqrt(N) to the nearest power-of-two
*/
template <int N>
struct sqrt_est {
enum { value = 1 << (log2_up<N>::value / 2) };
};
/**
* For performing a constant-division with a compile-time assertion that the
* Divisor evenly-divides the Dividend.
*/
template <int Dividend, int Divisor>
struct divide_assert {
enum { value = Dividend / Divisor };
static_assert((Dividend % Divisor == 0), "Not an even multiple");
};
/******************************************************************************
* Rounding
******************************************************************************/
/**
* Round dividend up to the nearest multiple of divisor
*/
template <typename dividend_t, typename divisor_t>
CUTLASS_HOST_DEVICE dividend_t round_nearest(dividend_t dividend, divisor_t divisor) {
return ((dividend + divisor - 1) / divisor) * divisor;
}
/**
* Greatest common divisor
*/
template <typename value_t>
CUTLASS_HOST_DEVICE value_t gcd(value_t a, value_t b) {
for (;;) {
if (a == 0) return b;
b %= a;
if (b == 0) return a;
a %= b;
}
}
/**
* Least common multiple
*/
template <typename value_t>
CUTLASS_HOST_DEVICE value_t lcm(value_t a, value_t b) {
value_t temp = gcd(a, b);
return temp ? (a / temp * b) : 0;
}
} // namespace cutlass

View File

@ -1,122 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
/**
* \file
* \brief Debugging and logging functionality
*/
#include <stdio.h>
namespace cutlass {
/******************************************************************************
* Debug and logging macros
******************************************************************************/
/**
* Formats and prints the given message to stdout
*/
#if !defined(CUDA_LOG)
#if !defined(__CUDA_ARCH__)
#define CUDA_LOG(format, ...) printf(format, __VA_ARGS__)
#else
#define CUDA_LOG(format, ...) \
printf("[block (%d,%d,%d), thread (%d,%d,%d)]: " format, \
blockIdx.x, \
blockIdx.y, \
blockIdx.z, \
threadIdx.x, \
threadIdx.y, \
threadIdx.z, \
__VA_ARGS__);
#endif
#endif
/**
* Formats and prints the given message to stdout only if DEBUG is defined
*/
#if !defined(CUDA_LOG_DEBUG)
#ifdef DEBUG
#define CUDA_LOG_DEBUG(format, ...) CUDA_LOG(format, __VA_ARGS__)
#else
#define CUDA_LOG_DEBUG(format, ...)
#endif
#endif
/**
* \brief The corresponding error message is printed to \p stderr (or \p stdout in device code)
* along with the supplied source context.
*
* \return The CUDA error.
*/
__host__ CUTLASS_DEVICE cudaError_t cuda_perror_impl(cudaError_t error,
const char* filename,
int line) {
(void)filename;
(void)line;
if (error) {
#if !defined(__CUDA_ARCH__)
fprintf(
stderr, "CUDA error %d [%s, %d]: %s\n", error, filename, line, cudaGetErrorString(error));
fflush(stderr);
#else
printf("CUDA error %d [%s, %d]\n", error, filename, line);
#endif
}
return error;
}
/**
* \brief Perror macro
*/
#ifndef CUDA_PERROR
#define CUDA_PERROR(e) cuda_perror_impl((cudaError_t)(e), __FILE__, __LINE__)
#endif
/**
* \brief Perror macro with exit
*/
#ifndef CUDA_PERROR_EXIT
#define CUDA_PERROR_EXIT(e) \
if (cuda_perror_impl((cudaError_t)(e), __FILE__, __LINE__)) { \
exit(1); \
}
#endif
/**
* \brief Perror macro only if DEBUG is defined
*/
#ifndef CUDA_PERROR_DEBUG
#ifdef DEBUG
#define CUDA_PERROR_DEBUG(e) CUDA_PERROR(e)
#else
#define CUDA_PERROR_DEBUG(e) (e)
#endif
#endif
} // namespace cutlass

View File

@ -1,801 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
#pragma once
/**
* \file
* \brief C++ features that may be otherwise unimplemented for CUDA device functions.
*
* This file has three components:
*
* (1) Macros:
* - Empty macro defines for C++ keywords not supported by the current
* version of C++. These simply allow compilation to proceed (but do
* not provide the added semantics).
* - \p noexcept
* - \p constexpr
* - \p nullptr
* - \p static_assert
*
* - Macro functions that we need in constant expressions because the
* C++ equivalents require constexpr compiler support. These are
* prefixed with \p __NV_STD_*
* - \p __NV_STD_MAX
* - \p __NV_STD_MIN
*
* (2) Re-implementations of STL functions and types:
* - C++ features that need the \p __device__ annotation. These are
* placed into the \p platform namespace.
* - \p plus
* - \p less
* - \p greater
* - \p min
* - \p max
* - \p methods on std::pair (==, !=, <, <=, >, >=, and make_pair())
*
* (3) Stop-gap implementations of unsupported STL functions and types:
* - STL functions and types defined by C++ 11/14/17/etc. that are not
* provided by the current version of C++. These are placed into the
* \p platform namespace
* - \p integral_constant
* - \p nullptr_t
* - \p true_type
* - \p false_type
* - \p bool_constant
* - \p enable_if
* - \p conditional
* - \p is_same
* - \p is_base_of
* - \p remove_const
* - \p remove_volatile
* - \p remove_cv
* - \p is_volatile
* - \p is_pointer
* - \p is_void
* - \p is_integral
* - \p is_floating_point
* - \p is_arithmetic
* - \p is_fundamental
* - \p is_trivially_copyable
* - \p alignment_of
* - \p aligned_storage
*
* (4) Functions and types that are STL-like (but aren't in the STL):
* - \p TODO: min and max functors?
*
* The idea is that, as we drop support for older compilers, we can simply #define
* the \p __NV_STD_XYZ macros and \p platform namespace to alias their C++
* counterparts (or trivially find-and-replace their occurrences in code text).
*/
//-----------------------------------------------------------------------------
// Dependencies
//-----------------------------------------------------------------------------
#include <stdint.h>
#if !defined(__CUDACC_RTC__)
//-----------------------------------------------------------------------------
// Include STL files that platform provides functionality for
//-----------------------------------------------------------------------------
#include <algorithm> // Minimum/maximum operations
#include <cstddef> // nullptr_t
#include <functional> // Arithmetic operations
#include <utility> // For methods on std::pair
#if (!defined(_MSC_VER) && (__cplusplus >= 201103L)) || (defined(_MSC_VER) && (_MS_VER >= 1500))
#include <type_traits> // For integral constants, conditional metaprogramming, and type traits
#endif
#include <cutlass/cutlass.h>
#endif
/******************************************************************************
* Macros
******************************************************************************/
//-----------------------------------------------------------------------------
// Keywords
//-----------------------------------------------------------------------------
/// noexcept, constexpr
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1900))
#ifndef noexcept
#define noexcept
#endif
#ifndef constexpr
#define constexpr
#endif
#endif
/// nullptr
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1310))
#ifndef nullptr
#define nullptr 0
#endif
#endif
/// static_assert
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1600))
#ifndef static_assert
#define __platform_cat_(a, b) a##b
#define __platform_cat(a, b) __platform_cat_(a, b)
#define static_assert(__e, __m) typedef int __platform_cat(AsSeRt, __LINE__)[(__e) ? 1 : -1]
#endif
#endif
//-----------------------------------------------------------------------------
// Functions
//-----------------------------------------------------------------------------
/// Select maximum(a, b)
#ifndef __NV_STD_MAX
#define __NV_STD_MAX(a, b) (((b) > (a)) ? (b) : (a))
#endif
/// Select minimum(a, b)
#ifndef __NV_STD_MIN
#define __NV_STD_MIN(a, b) (((b) < (a)) ? (b) : (a))
#endif
/******************************************************************************
* Re-implementations
******************************************************************************/
namespace cutlass {
namespace platform {
//-----------------------------------------------------------------------------
// Arithmetic operations, comparisons <functional>
//-----------------------------------------------------------------------------
/// platform::plus
template <typename T>
struct plus {
CUTLASS_HOST_DEVICE constexpr T operator()(const T& lhs, const T& rhs) const { return lhs + rhs; }
};
/// std::less
template <typename T>
struct less {
CUTLASS_HOST_DEVICE constexpr bool operator()(const T& lhs, const T& rhs) const {
return lhs < rhs;
}
};
/// std::greater
template <typename T>
struct greater {
CUTLASS_HOST_DEVICE constexpr bool operator()(const T& lhs, const T& rhs) const {
return lhs > rhs;
}
};
//-----------------------------------------------------------------------------
// Minimum/maximum operations <algorithm>
//-----------------------------------------------------------------------------
/// std::min
template <typename T>
CUTLASS_HOST_DEVICE constexpr const T& min(const T& a, const T& b) {
return (b < a) ? b : a;
}
/// std::max
template <typename T>
CUTLASS_HOST_DEVICE constexpr const T& max(const T& a, const T& b) {
return (a < b) ? b : a;
}
#if !defined(__CUDACC_RTC__)
//-----------------------------------------------------------------------------
// Methods on std::pair
//-----------------------------------------------------------------------------
using std::pair;
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator==(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return (lhs.first == rhs.first) && (lhs.second == rhs.second);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator!=(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return (lhs.first != rhs.first) && (lhs.second != rhs.second);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator<(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return (lhs.first < rhs.first) ? true : (rhs.first < lhs.first) ? false
: (lhs.second < rhs.second);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator<=(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return !(rhs < lhs);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator>(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return (rhs < lhs);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE constexpr bool operator>=(const pair<T1, T2>& lhs, const pair<T1, T2>& rhs) {
return !(lhs < rhs);
}
template <class T1, class T2>
CUTLASS_HOST_DEVICE std::pair<T1, T2> make_pair(T1 t, T2 u) {
std::pair<T1, T2> retval;
retval.first = t;
retval.second = u;
return retval;
}
#endif
} // namespace platform
/******************************************************************************
* Implementations of C++ 11/14/17/... STL features
******************************************************************************/
namespace platform {
//-----------------------------------------------------------------------------
// Integral constant helper types <type_traits>
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
/// std::integral_constant
template <typename value_t, value_t V>
struct integral_constant;
/// std::integral_constant
template <typename value_t, value_t V>
struct integral_constant {
static const value_t value = V;
typedef value_t value_type;
typedef integral_constant<value_t, V> type;
CUTLASS_HOST_DEVICE operator value_type() const { return value; }
CUTLASS_HOST_DEVICE const value_type operator()() const { return value; }
};
#else
using std::integral_constant;
using std::pair;
#endif
/// The type used as a compile-time boolean with true value.
typedef integral_constant<bool, true> true_type;
/// The type used as a compile-time boolean with false value.
typedef integral_constant<bool, false> false_type;
#if (!defined(_MSC_VER) && (__cplusplus < 201402L)) || (defined(_MSC_VER) && (_MSC_VER < 1900))
/// std::bool_constant
template <bool V>
struct bool_constant : platform::integral_constant<bool, V> {};
#else
using std::bool_constant;
#endif
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1700))
/// std::nullptr_t
struct nullptr_t {};
#else
using std::nullptr_t;
#endif
//-----------------------------------------------------------------------------
// Conditional metaprogramming <type_traits>
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1600))
/// std::enable_if (true specialization)
template <bool C, typename T = void>
struct enable_if {
typedef T type;
};
/// std::enable_if (false specialization)
template <typename T>
struct enable_if<false, T> {};
/// std::conditional (true specialization)
template <bool B, class T, class F>
struct conditional {
typedef T type;
};
/// std::conditional (false specialization)
template <class T, class F>
struct conditional<false, T, F> {
typedef F type;
};
#else
using std::enable_if;
using std::conditional;
#endif
//-----------------------------------------------------------------------------
// Const/volatility specifiers <type_traits>
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
/// std::remove_const (non-const specialization)
template <typename T>
struct remove_const {
typedef T type;
};
/// std::remove_const (const specialization)
template <typename T>
struct remove_const<const T> {
typedef T type;
};
/// std::remove_volatile (non-volatile specialization)
template <typename T>
struct remove_volatile {
typedef T type;
};
/// std::remove_volatile (volatile specialization)
template <typename T>
struct remove_volatile<volatile T> {
typedef T type;
};
/// std::remove_cv
template <typename T>
struct remove_cv {
typedef typename remove_volatile<typename remove_const<T>::type>::type type;
};
#else
using std::remove_const;
using std::remove_volatile;
using std::remove_cv;
#endif
//-----------------------------------------------------------------------------
// Type relationships <type_traits>
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
/// std::is_same (false specialization)
template <typename A, typename B>
struct is_same : false_type {};
/// std::is_same (true specialization)
template <typename A>
struct is_same<A, A> : true_type {};
/// Helper for std::is_base_of
template <typename BaseT, typename DerivedT>
struct is_base_of_helper {
typedef char (&yes)[1];
typedef char (&no)[2];
template <typename B, typename D>
struct dummy {
CUTLASS_HOST_DEVICE operator B*() const;
CUTLASS_HOST_DEVICE operator D*();
};
template <typename T>
CUTLASS_HOST_DEVICE static yes check(DerivedT*, T);
CUTLASS_HOST_DEVICE static no check(BaseT*, int);
static const bool value = sizeof(check(dummy<BaseT, DerivedT>(), int())) == sizeof(yes);
};
/// std::is_base_of
template <typename BaseT, typename DerivedT>
struct is_base_of
: integral_constant<bool,
(is_base_of_helper<typename remove_cv<BaseT>::type,
typename remove_cv<DerivedT>::type>::value) ||
(is_same<typename remove_cv<BaseT>::type,
typename remove_cv<DerivedT>::type>::value)> {};
#else
using std::is_same;
using std::is_base_of;
#endif
//-----------------------------------------------------------------------------
// Type properties <type_traits>
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
/// std::is_volatile
template <typename T>
struct is_volatile : false_type {};
template <typename T>
struct is_volatile<volatile T> : true_type {};
/// Helper for std::is_pointer (false specialization)
template <typename T>
struct is_pointer_helper : false_type {};
/// Helper for std::is_pointer (true specialization)
template <typename T>
struct is_pointer_helper<T*> : true_type {};
/// std::is_pointer
template <typename T>
struct is_pointer : is_pointer_helper<typename remove_cv<T>::type> {};
/// std::is_void
template <typename T>
struct is_void : is_same<void, typename remove_cv<T>::type> {};
/// std::is_integral
template <typename T>
struct is_integral : false_type {};
template <>
struct is_integral<char> : true_type {};
template <>
struct is_integral<signed char> : true_type {};
template <>
struct is_integral<unsigned char> : true_type {};
template <>
struct is_integral<short> : true_type {};
template <>
struct is_integral<unsigned short> : true_type {};
template <>
struct is_integral<int> : true_type {};
template <>
struct is_integral<unsigned int> : true_type {};
template <>
struct is_integral<long> : true_type {};
template <>
struct is_integral<unsigned long> : true_type {};
template <>
struct is_integral<long long> : true_type {};
template <>
struct is_integral<unsigned long long> : true_type {};
template <typename T>
struct is_integral<volatile T> : is_integral<T> {};
template <typename T>
struct is_integral<const T> : is_integral<T> {};
template <typename T>
struct is_integral<const volatile T> : is_integral<T> {};
/// std::is_floating_point
template <typename T>
struct is_floating_point
: integral_constant<bool,
(is_same<float, typename remove_cv<T>::type>::value ||
is_same<double, typename remove_cv<T>::type>::value)> {};
/// std::is_arithmetic
template <typename T>
struct is_arithmetic
: integral_constant<bool, (is_integral<T>::value || is_floating_point<T>::value)> {};
/// std::is_fundamental
template <typename T>
struct is_fundamental
: integral_constant<bool,
(is_arithmetic<T>::value || is_void<T>::value ||
is_same<nullptr_t, typename remove_cv<T>::type>::value)> {};
#else
using std::is_volatile;
using std::is_pointer;
using std::is_void;
using std::is_integral;
using std::is_floating_point;
using std::is_arithmetic;
using std::is_fundamental;
#endif
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1800)) || \
(defined(__GNUG__) && (__GNUC__ < 5))
/**
* std::is_trivially_copyable
*
* This implementation only evaluates true if T is fundamental or pointer
*
* Without help from partial template specializations provided by the user for
* a specific class or struct, this trait will never report that the specified
* class or struct is trivially-copyable ; this is always safe,
* if possibly sub-optimal.
*/
template <typename T>
struct is_trivially_copyable
: integral_constant<bool, (is_fundamental<T>::value || is_pointer<T>::value)> {};
#else
using std::is_trivially_copyable;
#endif
//-----------------------------------------------------------------------------
// Alignment and layout utilities
//-----------------------------------------------------------------------------
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1500))
/// std::alignment_of
template <typename value_t>
struct alignment_of {
struct pad {
value_t val;
char byte;
};
enum { value = sizeof(pad) - sizeof(value_t) };
};
#else
template <typename value_t>
struct alignment_of : std::alignment_of<value_t> {};
#endif
/* 16B specializations where 32-bit Win32 host compiler disagrees with device compiler */
template <>
struct alignment_of<int4> {
enum { value = 16 };
};
template <>
struct alignment_of<uint4> {
enum { value = 16 };
};
template <>
struct alignment_of<float4> {
enum { value = 16 };
};
template <>
struct alignment_of<long4> {
enum { value = 16 };
};
template <>
struct alignment_of<ulong4> {
enum { value = 16 };
};
template <>
struct alignment_of<longlong2> {
enum { value = 16 };
};
template <>
struct alignment_of<ulonglong2> {
enum { value = 16 };
};
template <>
struct alignment_of<double2> {
enum { value = 16 };
};
template <>
struct alignment_of<longlong4> {
enum { value = 16 };
};
template <>
struct alignment_of<ulonglong4> {
enum { value = 16 };
};
template <>
struct alignment_of<double4> {
enum { value = 16 };
};
// Specializations for volatile/const qualified types
template <typename value_t>
struct alignment_of<volatile value_t> : alignment_of<value_t> {};
template <typename value_t>
struct alignment_of<const value_t> : alignment_of<value_t> {};
template <typename value_t>
struct alignment_of<const volatile value_t> : alignment_of<value_t> {};
#if (!defined(_MSC_VER) && (__cplusplus < 201103L)) || (defined(_MSC_VER) && (_MSC_VER < 1800))
template <size_t Align>
struct aligned_chunk;
template <>
struct __align__(1) aligned_chunk<1> {
uint8_t buff;
};
template <>
struct __align__(2) aligned_chunk<2> {
uint16_t buff;
};
template <>
struct __align__(4) aligned_chunk<4> {
uint32_t buff;
};
template <>
struct __align__(8) aligned_chunk<8> {
uint32_t buff[2];
};
template <>
struct __align__(16) aligned_chunk<16> {
uint32_t buff[4];
};
template <>
struct __align__(32) aligned_chunk<32> {
uint32_t buff[8];
};
template <>
struct __align__(64) aligned_chunk<64> {
uint32_t buff[16];
};
template <>
struct __align__(128) aligned_chunk<128> {
uint32_t buff[32];
};
template <>
struct __align__(256) aligned_chunk<256> {
uint32_t buff[64];
};
template <>
struct __align__(512) aligned_chunk<512> {
uint32_t buff[128];
};
template <>
struct __align__(1024) aligned_chunk<1024> {
uint32_t buff[256];
};
template <>
struct __align__(2048) aligned_chunk<2048> {
uint32_t buff[512];
};
template <>
struct __align__(4096) aligned_chunk<4096> {
uint32_t buff[1024];
};
/// std::aligned_storage
template <size_t Len, size_t Align>
struct aligned_storage {
typedef aligned_chunk<Align> type[Len / sizeof(aligned_chunk<Align>)];
};
#else
using std::aligned_storage;
#endif
#if !defined(__CUDACC_RTC__)
/// Default deleter
template <typename T>
struct default_delete {
void operator()(T* ptr) const { delete ptr; }
};
/// Partial specialization for deleting array types
template <typename T>
struct default_delete<T[]> {
void operator()(T* ptr) const { delete[] ptr; }
};
/// std::unique_ptr
template <class T, class Deleter = default_delete<T> >
class unique_ptr {
public:
typedef T* pointer;
typedef T element_type;
typedef Deleter deleter_type;
private:
/// Pointer to memory
pointer _ptr;
/// Deleter
deleter_type _deleter;
public:
unique_ptr() : _ptr(nullptr) {}
unique_ptr(pointer p) : _ptr(p) {}
~unique_ptr() {
if (_ptr) {
_deleter(_ptr);
}
}
/// Returns a pointer to the managed object or nullptr if no object is owned.
pointer get() const noexcept { return _ptr; }
/// Releases ownership of the managed object, if any
pointer release() noexcept {
pointer p(_ptr);
_ptr = nullptr;
return p;
}
/// Replaces the managed object, deleting the old object.
void reset(pointer p = pointer()) noexcept {
pointer old_ptr = _ptr;
_ptr = p;
if (old_ptr != nullptr) {
get_deleter()(old_ptr);
}
}
/// Swaps the managed objects with *this and another unique_ptr
void swap(unique_ptr& other) noexcept { std::swap(_ptr, other._ptr); }
/// Returns the deleter object
Deleter& get_deleter() noexcept { return _deleter; }
/// Returns the deleter object
Deleter const& get_deleter() const noexcept { return _deleter; }
/// Checks whether an object is owned
operator bool() const noexcept { return _ptr != nullptr; }
/// Dereferences the unique_ptr
T& operator*() const { return *_ptr; }
/// Returns a pointer to the managed object
pointer operator->() const noexcept { return _ptr; }
/// Array access to managed object
T& operator[](size_t i) const { return _ptr[i]; }
};
/// Specializes the swap algorithm
template <typename T, typename Deleter>
void swap(unique_ptr<T, Deleter>& lhs, unique_ptr<T, Deleter>& rhs) noexcept {
lhs.swap(rhs);
}
#endif
}; // namespace platform
}; // namespace cutlass

View File

@ -1,229 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Defines a 1D vector of elements held in the registers of each thread.
*/
#pragma once
#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
#include <cuda_fp16.h>
#endif
#include <cutlass/util/platform.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
template <size_t kAlignment_>
struct AlignedStruct {};
template <>
struct __align__(1) AlignedStruct<1>{};
template <>
struct __align__(2) AlignedStruct<2>{};
template <>
struct __align__(4) AlignedStruct<4>{};
template <>
struct __align__(8) AlignedStruct<8>{};
template <>
struct __align__(16) AlignedStruct<16>{};
template <>
struct __align__(32) AlignedStruct<32>{};
template <>
struct __align__(64) AlignedStruct<64>{};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int kLanes_>
union Vector {
/// The scalar type.
typedef Scalar_ Scalar;
/// The number of elements in the vector.
enum { kLanes = kLanes_ };
/// The size of the vector.
enum { kVectorSize = kLanes * (int)sizeof(Scalar) };
/// The number of registers needed to store the vector.
enum { kRegisters = kVectorSize < 4 ? 1 : kVectorSize / 4 };
// Make sure that the vector type makes sense.
static_assert(kVectorSize <= 16, "Vector type is too large");
/// The aligned storage to make sure we have good alignment.
AlignedStruct<kVectorSize> aligned_;
/// The associated array of scalars.
Scalar scalars[kLanes];
/// The data in registers.
uint32_t registers[kRegisters];
/// Accessor to the ith lane.
CUTLASS_DEVICE Scalar const& operator[](uint32_t i) const { return scalars[i]; }
/// Accessor to the ith lane.
CUTLASS_DEVICE Scalar& operator[](uint32_t i) { return scalars[i]; }
};
////////////////////////////////////////////////////////////////////////////////////////////////////
#if !defined(__CUDACC_RTC__) || defined(CUTLASS_NVRTC_HAS_FP16)
template <int kLanes_>
union Vector<half, kLanes_> {
/// The scalar type.
typedef half Scalar;
/// The number of elements in the vector.
enum { kLanes = kLanes_ };
/// The size of the vector.
enum { kVectorSize = kLanes * (int)sizeof(Scalar) };
/// The number of registers needed to store the vector.
enum { kRegisters = kVectorSize < 4 ? 1 : kVectorSize / 4 };
// Make sure that the vector type makes sense.
static_assert(kVectorSize <= size_t(16), "Vector type is too large");
/// The aligned storage to make sure we have good alignment.
AlignedStruct<kVectorSize> aligned_;
/// The associated array of scalars.
uint16_t scalars[kLanes];
/// The data in registers.
uint32_t registers[kRegisters];
/// Accessor to the ith lane.
CUTLASS_DEVICE Scalar const& operator[](uint32_t i) const {
return reinterpret_cast<Scalar const&>(scalars[i]);
}
/// Accessor to the ith lane.
CUTLASS_DEVICE Scalar& operator[](uint32_t i) { return reinterpret_cast<Scalar&>(scalars[i]); }
};
#endif
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_>
CUTLASS_DEVICE void make_zero(Scalar_& x) {
x = Scalar_(0);
}
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Element_, int kLanes_ = 1>
struct Vectorize {
typedef Vector<Element_, kLanes_> Type;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Element_>
struct Vectorize<Element_, 1> {
typedef Element_ Type;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
template <typename Scalar_, int kLanes_>
CUTLASS_DEVICE void make_zero(Vector<Scalar_, kLanes_>& vec) {
for (int i = 0; i < Vector<Scalar_, kLanes_>::kRegisters; ++i) {
vec.registers[i] = 0;
}
}
////////////////////////////////////////////////////////////////////////////////////////////////////
//
// cutlass::Extent similar to std::extent but applicable to CUTLASS types
//
/// Returns the extent of a scalar or vector
template <typename T>
struct Extent {
static size_t const kValue = 1;
};
/// Returns the number of lanes of a vector if need be
template <typename T, int Lanes>
struct Extent<Vector<T, Lanes> > {
static size_t const kValue = Lanes;
};
/// Returns the number of lanes of a vector if need be
template <typename T, int Lanes>
struct Extent<Vector<T, Lanes> const> {
static size_t const kValue = Lanes;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Traits describing properties of vectors and scalar-as-vectors
template <typename T>
struct VectorTraits {
/// Scalar type
typedef T Scalar;
/// Number of lanes of vector
static int const kLanes = 1;
/// True if the type is actually a cutlass::Vector, otherwise false
static bool const IsVector = false;
/// Type that is always a vector
typedef Vector<T, 1> Vector;
};
/// Partial specialization for actual cutlass::Vector
template <typename T, int Lanes>
struct VectorTraits<Vector<T, Lanes> > {
/// Scalar type
typedef T Scalar;
/// Number of lanes of vector
static int const kLanes = Lanes;
/// Type is actually a cutlass::Vector
static bool const IsVector = true;
/// Type that is always a Vector
typedef Vector<T, Lanes> Vector;
};
/// Partial specialization for actual cutlass::Vector
template <typename T, int Lanes>
struct VectorTraits<Vector<T, Lanes> const> {
/// Scalar type
typedef T Scalar;
/// Number of lanes of vector
static int const kLanes = Lanes;
/// Type is actually a cutlass::Vector
static bool const IsVector = true;
/// Type that is always a Vector
typedef Vector<T, Lanes> Vector;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass

View File

@ -1,193 +0,0 @@
/***************************************************************************************************
* Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without modification, are permitted
* provided that the following conditions are met:
* * Redistributions of source code must retain the above copyright notice, this list of
* conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright notice, this list of
* conditions and the following disclaimer in the documentation and/or other materials
* provided with the distribution.
* * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
* to endorse or promote products derived from this software without specific prior written
* permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
* BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
* STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
**************************************************************************************************/
/*! \file
\brief Abstractions for loading and storing matrices using the CUDA WMMA API.
*/
#pragma once
#if defined(__CUDACC__) && (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 700)
// Dependent header files should use the following macro to guard all code using
// nvcuda::wmma:: to enable compilation for CUDA Compute Capabilities < sm_70.
// Earlier shader models not support Tensor Cores.
#define CUTLASS_USE_WMMA_API
#include "stdio.h"
#include <crt/mma.h>
#include <cutlass/fragment.h>
#include <cutlass/load_store.h>
#include <cutlass/matrix_traits.h>
#include <cutlass/shape.h>
#include <cutlass/vector.h>
namespace cutlass {
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Statically maps cutlass::MatrixLayout => nvcuda::wmma layout tags
template <MatrixLayout::Kind kLayout_>
struct WmmaLayout {
typedef nvcuda::wmma::col_major Layout;
};
/// Statically maps cutlass::MatrixLayout => nvcuda::wmma layout tags
template <>
struct WmmaLayout<MatrixLayout::kRowMajor> {
typedef nvcuda::wmma::row_major Layout;
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to nvcuda::wmma fragment load and store operations
template <GemmOperand::Kind kOperand_,
MatrixLayout::Kind kLayout_,
typename Scalar_,
typename WmmaShape_>
struct WmmaMatrix {};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to nvcuda::wmma fragment accessors for A operand
template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
struct WmmaMatrix<GemmOperand::kA, kLayout_, Scalar_, WmmaShape_>
: public nvcuda::wmma::fragment<
/// The nvcuda::wmma operand name.
nvcuda::wmma::matrix_a,
/// The dimensions.
WmmaShape_::kW,
WmmaShape_::kH,
WmmaShape_::kD,
/// The scalar.
Scalar_,
/// The layout.
typename WmmaLayout<kLayout_>::Layout> {
/// This type.
typedef WmmaMatrix<GemmOperand::kA, kLayout_, Scalar_, WmmaShape_> This_;
/// Fill-in the element.
CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
nvcuda::wmma::fill_fragment(*this, x);
return *this;
}
/// Load from memory.
CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
nvcuda::wmma::load_matrix_sync(*this, pointer, stride);
}
/// Store to memory.
CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
nvcuda::wmma::store_matrix_sync(pointer, *this, stride);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to nvcuda::wmma fragment accessors for B operand
template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
struct WmmaMatrix<GemmOperand::kB, kLayout_, Scalar_, WmmaShape_>
: public nvcuda::wmma::fragment<
/// The nvcuda::wmma operand name.
nvcuda::wmma::matrix_b,
/// The dimensions.
WmmaShape_::kW,
WmmaShape_::kH,
WmmaShape_::kD,
/// The scalar.
Scalar_,
/// The layout.
typename WmmaLayout<kLayout_>::Layout> {
/// This type.
typedef WmmaMatrix<GemmOperand::kB, kLayout_, Scalar_, WmmaShape_> This_;
/// Fill-in the element.
CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
nvcuda::wmma::fill_fragment(*this, x);
return *this;
}
/// Load from memory.
CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
nvcuda::wmma::load_matrix_sync(*this, pointer, stride);
}
/// Store to memory.
CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
nvcuda::wmma::store_matrix_sync(pointer, *this, stride);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
/// Adapter to nvcuda::wmma fragment accessors for C operand
template <MatrixLayout::Kind kLayout_, typename Scalar_, typename WmmaShape_>
struct WmmaMatrix<GemmOperand::kC, kLayout_, Scalar_, WmmaShape_>
: public nvcuda::wmma::fragment<
/// The nvcuda::wmma operand name.
nvcuda::wmma::accumulator,
/// The dimensions.
WmmaShape_::kW,
WmmaShape_::kH,
WmmaShape_::kD,
/// The scalar.
Scalar_> {
/// This type.
typedef WmmaMatrix<GemmOperand::kC, kLayout_, Scalar_, WmmaShape_> This_;
/// The layout.
static MatrixLayout::Kind const kLayout = kLayout_;
/// Fill-in the element.
CUTLASS_DEVICE This_& operator=(Scalar_ const& x) {
nvcuda::wmma::fill_fragment(*this, x);
return *this;
}
/// Load from memory.
CUTLASS_DEVICE void load(Scalar_ const* pointer, int const stride) {
bool const kIsRowMajor = kLayout == MatrixLayout::kRowMajor;
nvcuda::wmma::load_matrix_sync(
*this,
pointer,
stride,
kIsRowMajor ? nvcuda::wmma::mem_row_major : nvcuda::wmma::mem_col_major);
}
/// Store to memory.
CUTLASS_DEVICE void store(Scalar_* pointer, int const stride) const {
bool const kIsRowMajor = kLayout == MatrixLayout::kRowMajor;
nvcuda::wmma::store_matrix_sync(
pointer,
*this,
stride,
kIsRowMajor ? nvcuda::wmma::mem_row_major : nvcuda::wmma::mem_col_major);
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////
} // namespace cutlass
#endif // defined CUTLASS_USE_WMMA_API

1
docs/_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-minimal

View File

@ -0,0 +1,145 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.11"/>
<title>CUTLASS: aligned_buffer.h File Reference</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
<td id="projectalign" style="padding-left: 0.5em;">
<div id="projectname">CUTLASS
</div>
<div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.11 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="namespaces.html"><span>Namespaces</span></a></li>
<li><a href="annotated.html"><span>Classes</span></a></li>
<li class="current"><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
<div id="navrow2" class="tabs2">
<ul class="tablist">
<li><a href="files.html"><span>File&#160;List</span></a></li>
<li><a href="globals.html"><span>File&#160;Members</span></a></li>
</ul>
</div>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div id="nav-path" class="navpath">
<ul>
<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li> </ul>
</div>
</div><!-- top -->
<div class="header">
<div class="summary">
<a href="#nested-classes">Classes</a> &#124;
<a href="#namespaces">Namespaces</a> </div>
<div class="headertitle">
<div class="title">aligned_buffer.h File Reference</div> </div>
</div><!--header-->
<div class="contents">
<p>AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory.
<a href="#details">More...</a></p>
<div class="textblock"><code>#include &quot;<a class="el" href="cutlass_8h_source.html">cutlass/cutlass.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
</div><div class="textblock"><div class="dynheader">
Include dependency graph for aligned_buffer.h:</div>
<div class="dyncontent">
<div class="center"><img src="aligned__buffer_8h__incl.png" border="0" usemap="#aligned__buffer_8h" alt=""/></div>
<map name="aligned__buffer_8h" id="aligned__buffer_8h">
</map>
</div>
</div><div class="textblock"><div class="dynheader">
This graph shows which files directly or indirectly include this file:</div>
<div class="dyncontent">
<div class="center"><img src="aligned__buffer_8h__dep__incl.png" border="0" usemap="#aligned__buffer_8hdep" alt=""/></div>
<map name="aligned__buffer_8hdep" id="aligned__buffer_8hdep">
</map>
</div>
</div>
<p><a href="aligned__buffer_8h_source.html">Go to the source code of this file.</a></p>
<table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
Classes</h2></td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1AlignedBuffer.html">cutlass::AlignedBuffer&lt; T, N, Align &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment. <a href="structcutlass_1_1AlignedBuffer.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table><table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
Namespaces</h2></td></tr>
<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table>
</div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
Generated by &#160;<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/>
</a> 1.8.11
</small></address>
</body>
</html>

View File

@ -0,0 +1 @@
6cbc6b81ede44b5f08afd4f4519d56d1

View File

@ -0,0 +1 @@
b26c62930ff7668b89f2ee6624e0be3a

File diff suppressed because one or more lines are too long

867
docs/annotated.html Normal file
View File

@ -0,0 +1,867 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.11"/>
<title>CUTLASS: Class List</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
<td id="projectalign" style="padding-left: 0.5em;">
<div id="projectname">CUTLASS
</div>
<div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.11 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="namespaces.html"><span>Namespaces</span></a></li>
<li class="current"><a href="annotated.html"><span>Classes</span></a></li>
<li><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
<div id="navrow2" class="tabs2">
<ul class="tablist">
<li class="current"><a href="annotated.html"><span>Class&#160;List</span></a></li>
<li><a href="classes.html"><span>Class&#160;Index</span></a></li>
<li><a href="inherits.html"><span>Class&#160;Hierarchy</span></a></li>
<li><a href="functions.html"><span>Class&#160;Members</span></a></li>
</ul>
</div>
</div><!-- top -->
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div class="header">
<div class="headertitle">
<div class="title">Class List</div> </div>
</div><!--header-->
<div class="contents">
<div class="textblock">Here are the classes, structs, unions and interfaces with brief descriptions:</div><div class="directory">
<div class="levels">[detail level <span onclick="javascript:toggleLevel(1);">1</span><span onclick="javascript:toggleLevel(2);">2</span><span onclick="javascript:toggleLevel(3);">3</span><span onclick="javascript:toggleLevel(4);">4</span><span onclick="javascript:toggleLevel(5);">5</span><span onclick="javascript:toggleLevel(6);">6</span>]</div><table class="directory">
<tr id="row_0_" class="even"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_0_" class="arrow" onclick="toggleFolder('0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass.html" target="_self">cutlass</a></td><td class="desc"></td></tr>
<tr id="row_0_0_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_0_" class="arrow" onclick="toggleFolder('0_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1arch.html" target="_self">arch</a></td><td class="desc"></td></tr>
<tr id="row_0_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation </td></tr>
<tr id="row_0_0_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_012_01_4_00_011_00_01int16__t8c4bac365710598317a69c489f7239db.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 2 &gt;, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_014_01_4_00_011_00_01int8__t_a1ef6624fc8c10126f17f4ee88283d72.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 1, 4 &gt;, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_012_00_011_01_4_00_011_00_01half__t_f3dc2e59f857ada163d1e0781ea8f391.html" target="_self">Mma&lt; gemm::GemmShape&lt; 1, 2, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_0116_00_014_01_4_00_0132_00_01half_0bcc4d05f9811035f08cc1b7f0154a4d.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 16, 4 &gt;, 32, half_t, LayoutA, half_t, LayoutB, ElementC, LayoutC, Operator &gt;</a></td><td class="desc">Matrix multiply-add operation specialized for the entire warp </td></tr>
<tr id="row_0_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__02a3f19a78995f97d793a668e0e4d4f0.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
<tr id="row_0_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_0116_00_018_00_018_01_4_00_0132_00_01half__96363097c47b056f0ca1911afd7f8b7a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 16, 8, 8 &gt;, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation - F16 = F16 * F16 + F16 </td></tr>
<tr id="row_0_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_011_00_011_01_4_00_011_00_01half__t_8cf78649807b93684f3d431bfa34ee28.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_ccde11d1bbbdab3702772ce44eb9729a.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_012_00_012_00_011_01_4_00_011_00_01half__t_c07cc6439298fa5486a719e577be2538.html" target="_self">Mma&lt; gemm::GemmShape&lt; 2, 2, 1 &gt;, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_01128_01_4_00_0132_00_01uint15918972b95027764b3a849b03075ed2b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 128 &gt;, 32, uint1b_t, layout::RowMajor, uint1b_t, layout::ColumnMajor, int, layout::RowMajor, OpXorPopc &gt;</a></td><td class="desc">Matrix multiply-add operation </td></tr>
<tr id="row_0_0_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__927179f46017ea5f58f859f1196c4829.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
<tr id="row_0_0_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__8ebae0cbdf333fddfe5c24d35ebe8e02.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * S8 + S32 </td></tr>
<tr id="row_0_0_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__5299c9c90c8f2f521be0c8cec1c3eb08.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
<tr id="row_0_0_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01int8__f083347e265b1e9eea5572d86ddb6bf9.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
<tr id="row_0_0_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_a62aa63a212985df306fb27e8a50aeae.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
<tr id="row_0_0_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_ab741d81fdc991345cb9e43c29fca573.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U8 * S8 + S32 </td></tr>
<tr id="row_0_0_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_5221708cec5828d35db1d1c47cb4964e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
<tr id="row_0_0_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0116_01_4_00_0132_00_01uint8_bef0c048bc0f8ba2d875cb7ab26d363b.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 16 &gt;, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S8 * U8 + S32 </td></tr>
<tr id="row_0_0_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_6e513ccbc44ae7909a60d93b9b5435b3.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
<tr id="row_0_0_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_0ee08a4520882d24ba9026879265e892.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * S4 + S32 </td></tr>
<tr id="row_0_0_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_4746fc55e614df0016c518d3fda2677e.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
<tr id="row_0_0_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01int4b_546e9ec6de6a5970b326da6f6280f1d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = S4 * U4 + S32 </td></tr>
<tr id="row_0_0_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b03e3b50dbcb30d0d1ac062f3a9d5abef.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
<tr id="row_0_0_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b6d968039dde5c9f062ab15f90a8049fe.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * S4 + S32 </td></tr>
<tr id="row_0_0_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4bc4b6ba004e25c44bfd9266c61f937dfb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
<tr id="row_0_0_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_0132_01_4_00_0132_00_01uint4b451d5cf5d7e8cbbe476afe3dab5c09b2.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 32 &gt;, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate &gt;</a></td><td class="desc">Matrix multiply-add operation: S32 = U4 * U4 + S32 </td></tr>
<tr id="row_0_0_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_b0242d7a01097510effbc4718040d3e5.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
<tr id="row_0_0_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_c7f88bfd32a544fba8111d2dcadeab11.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
<tr id="row_0_0_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_44a3b2a8df88a2b067f1284515cb5371.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
<tr id="row_0_0_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_4b7308177b308a272c1889fbe9670275.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
<tr id="row_0_0_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_5a9888862cebd333ecaf11f7262f77d4.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
<tr id="row_0_0_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_31defda8ea2b7d855642ffd77da1a411.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
<tr id="row_0_0_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_839a7c8bb938d1661f4611e68f85d8cb.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F32 = F16 * F16 + F32 </td></tr>
<tr id="row_0_0_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_018_00_018_00_014_01_4_00_018_00_01half__t_73d9802d6b944a5299bc255887db6bbc.html" target="_self">Mma&lt; gemm::GemmShape&lt; 8, 8, 4 &gt;, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd &gt;</a></td><td class="desc">Matrix multiply-add operation: F16 = F16 * F16 + F16 </td></tr>
<tr id="row_0_0_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmma.html" target="_self">PtxWmma</a></td><td class="desc">WMMA Matrix multiply-add operation </td></tr>
<tr id="row_0_0_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadA.html" target="_self">PtxWmmaLoadA</a></td><td class="desc">WMMA PTX string load for A, B, and C matrices </td></tr>
<tr id="row_0_0_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadB.html" target="_self">PtxWmmaLoadB</a></td><td class="desc"></td></tr>
<tr id="row_0_0_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaLoadC.html" target="_self">PtxWmmaLoadC</a></td><td class="desc"></td></tr>
<tr id="row_0_0_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1PtxWmmaStoreD.html" target="_self">PtxWmmaStoreD</a></td><td class="desc">WMMA store for matrix D </td></tr>
<tr id="row_0_0_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm50.html" target="_self">Sm50</a></td><td class="desc"></td></tr>
<tr id="row_0_0_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm60.html" target="_self">Sm60</a></td><td class="desc"></td></tr>
<tr id="row_0_0_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm61.html" target="_self">Sm61</a></td><td class="desc"></td></tr>
<tr id="row_0_0_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm70.html" target="_self">Sm70</a></td><td class="desc"></td></tr>
<tr id="row_0_0_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm72.html" target="_self">Sm72</a></td><td class="desc"></td></tr>
<tr id="row_0_0_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Sm75.html" target="_self">Sm75</a></td><td class="desc"></td></tr>
<tr id="row_0_0_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1half__t_00_01LayoutA___00_01cutlass_1_84e30c8cc93eeb7ca02f651bd16d4c38.html" target="_self">Wmma&lt; Shape_, cutlass::half_t, LayoutA_, cutlass::half_t, LayoutB_, ElementC_, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_0_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1int4b__t_00_01LayoutA___00_01cutlass_16fd808a90b3cf9d7cfc99f30888ca3fe.html" target="_self">Wmma&lt; Shape_, cutlass::int4b_t, LayoutA_, cutlass::int4b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_0_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01cutlass_1_1uint1b__t_00_01LayoutA___00_01cutlass_c80a7ea4d219cd9b13b560b493338028.html" target="_self">Wmma&lt; Shape_, cutlass::uint1b_t, LayoutA_, cutlass::uint1b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpXorPopc &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_0_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01int8__t_00_01LayoutA___00_01int8__t_00_01LayoutB_505c57bb6818a941dc16f00cf35a9ec0.html" target="_self">Wmma&lt; Shape_, int8_t, LayoutA_, int8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_0_61_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1arch_1_1Wmma_3_01Shape___00_01uint8__t_00_01LayoutA___00_01uint8__t_00_01Layout219a464a1248ebfc37aa29bcb10cb1b0.html" target="_self">Wmma&lt; Shape_, uint8_t, LayoutA_, uint8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_1_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_1_" class="arrow" onclick="toggleFolder('0_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1device__memory.html" target="_self">device_memory</a></td><td class="desc"></td></tr>
<tr id="row_0_1_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_1_0_" class="arrow" onclick="toggleFolder('0_1_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation.html" target="_self">allocation</a></td><td class="desc">Device allocation abstraction that tracks size and capacity </td></tr>
<tr id="row_0_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1device__memory_1_1allocation_1_1deleter.html" target="_self">deleter</a></td><td class="desc">Delete functor for CUDA device memory </td></tr>
<tr id="row_0_2_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_2_" class="arrow" onclick="toggleFolder('0_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue.html" target="_self">epilogue</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_0_" class="arrow" onclick="toggleFolder('0_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_0_" class="arrow" onclick="toggleFolder('0_2_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1Convert.html" target="_self">Convert</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1Convert_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_0_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_1_" class="arrow" onclick="toggleFolder('0_2_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombination.html" target="_self">LinearCombination</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_1_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombination_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_0_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_2_" class="arrow" onclick="toggleFolder('0_2_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp.html" target="_self">LinearCombinationClamp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationClamp_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_0_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_3_" class="arrow" onclick="toggleFolder('0_2_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu.html" target="_self">LinearCombinationRelu</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_0_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_4_" class="arrow" onclick="toggleFolder('0_2_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_01int_00_01float_00_01Round_01_4.html" target="_self">LinearCombinationRelu&lt; ElementOutput_, Count, int, float, Round &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1LinearCombinationRelu_3_01ElementOutput___00_01Count_00_00274a94522c46cd041d0b10d484e2ef3.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_0_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_0_5_" class="arrow" onclick="toggleFolder('0_2_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus.html" target="_self">ReductionOpPlus</a></td><td class="desc"></td></tr>
<tr id="row_0_2_0_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1thread_1_1ReductionOpPlus_1_1Params.html" target="_self">Params</a></td><td class="desc">Host-constructable parameters structure </td></tr>
<tr id="row_0_2_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_1_" class="arrow" onclick="toggleFolder('0_2_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_" class="arrow" onclick="toggleFolder('0_2_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1threadblock_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" target="_self">RowArrangement</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> determines how one or more warps cover a region of consecutive rows </td></tr>
<tr id="row_0_2_1_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini91159e6f7e123d881e3ec45101fa4f81.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, false &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 1D tiled arrangement </td></tr>
<tr id="row_0_2_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_2_1_0_2_" class="arrow" onclick="toggleFolder('0_2_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemaini6d8790249bf12cac580da73bb37eb791.html" target="_self">RowArrangement&lt; Shape, WarpsRemaining, ElementsPerAccess, ElementSize, true &gt;</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement.html" title="RowArrangement determines how one or more warps cover a region of consecutive rows. ">RowArrangement</a> in which each warp's access is a 2D tiled arrangement </td></tr>
<tr id="row_0_2_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1detail_1_1RowArrangement_3_01Shape_00_01WarpsRemainief28e98b3f284469f271d28aba73de2e.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueComplexTensorOp.html" target="_self">DefaultEpilogueComplexTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
<tr id="row_0_2_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueSimt.html" target="_self">DefaultEpilogueSimt</a></td><td class="desc">Defines sensible defaults for epilogues for SimtOps </td></tr>
<tr id="row_0_2_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueTensorOp.html" target="_self">DefaultEpilogueTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
<tr id="row_0_2_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueVoltaTensorOp.html" target="_self">DefaultEpilogueVoltaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for TensorOps </td></tr>
<tr id="row_0_2_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultEpilogueWmmaTensorOp.html" target="_self">DefaultEpilogueWmmaTensorOp</a></td><td class="desc">Defines sensible defaults for epilogues for WMMA TensorOps </td></tr>
<tr id="row_0_2_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedEpilogueTensorOp.html" target="_self">DefaultInterleavedEpilogueTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_7_" class="arrow" onclick="toggleFolder('0_2_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp.html" target="_self">DefaultInterleavedThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultInterleavedThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_8_" class="arrow" onclick="toggleFolder('0_2_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt.html" target="_self">DefaultThreadMapSimt</a></td><td class="desc">Defines the optimal thread map for SIMT accumulator layouts </td></tr>
<tr id="row_0_2_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapSimt_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_9_" class="arrow" onclick="toggleFolder('0_2_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp.html" target="_self">DefaultThreadMapTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp.html" target="_self">DefaultThreadMapVoltaTensorOp</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_11_" class="arrow" onclick="toggleFolder('0_2_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__95db04b7b72e34283958bd7fbf851d16.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, float &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__52116c60c62f0fd520071558e42b814f.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_12_" class="arrow" onclick="toggleFolder('0_2_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__d58c94abc36b7c5c109b55202c6992e7.html" target="_self">DefaultThreadMapVoltaTensorOp&lt; ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, half_t &gt;</a></td><td class="desc">Defines the optimal thread map for TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapVoltaTensorOp_3_01ThreadblockShape__4433cc988100e98097a748d2670fb0fc.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_13_" class="arrow" onclick="toggleFolder('0_2_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp.html" target="_self">DefaultThreadMapWmmaTensorOp</a></td><td class="desc">Defines the optimal thread map for Wmma TensorOp accumulator layouts </td></tr>
<tr id="row_0_2_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DefaultThreadMapWmmaTensorOp_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_14_" class="arrow" onclick="toggleFolder('0_2_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp.html" target="_self">DirectEpilogueTensorOp</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator </td></tr>
<tr id="row_0_2_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure for host-constructible state </td></tr>
<tr id="row_0_2_1_14_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1DirectEpilogueTensorOp_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
<tr id="row_0_2_1_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" target="_self">Epilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
<tr id="row_0_2_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_16_" class="arrow" onclick="toggleFolder('0_2_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase.html" target="_self">EpilogueBase</a></td><td class="desc">Base class for epilogues defining warp-level </td></tr>
<tr id="row_0_2_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1EpilogueBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
<tr id="row_0_2_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_17_" class="arrow" onclick="toggleFolder('0_2_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue.html" target="_self">InterleavedEpilogue</a></td><td class="desc"><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1Epilogue.html" title="Epilogue operator without splitk. ">Epilogue</a> operator without splitk </td></tr>
<tr id="row_0_2_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedEpilogue_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
<tr id="row_0_2_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_18_" class="arrow" onclick="toggleFolder('0_2_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap.html" target="_self">InterleavedOutputTileThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedOutputTileThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_19_" class="arrow" onclick="toggleFolder('0_2_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator.html" target="_self">InterleavedPredicatedTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
<tr id="row_0_2_1_19_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1InterleavedPredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_20_" class="arrow" onclick="toggleFolder('0_2_1_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap.html" target="_self">OutputTileOptimalThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1CompactedThreadMap.html" target="_self">CompactedThreadMap</a></td><td class="desc">Compacted thread map in which the 4D region is contiguous </td></tr>
<tr id="row_0_2_1_20_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileOptimalThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileShape.html" target="_self">OutputTileShape</a></td><td class="desc">Tuple defining point in output tile </td></tr>
<tr id="row_0_2_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1OutputTileThreadMap.html" target="_self">OutputTileThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_1_23_" class="arrow" onclick="toggleFolder('0_2_1_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" target="_self">Mask</a></td><td class="desc"><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Mask.html" title="Mask object. ">Mask</a> object </td></tr>
<tr id="row_0_2_1_23_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1threadblock_1_1PredicatedTileIterator_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
<tr id="row_0_2_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1threadblock_1_1SharedLoadIterator.html" target="_self">SharedLoadIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_2_" class="arrow" onclick="toggleFolder('0_2_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1epilogue_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp.html" target="_self">FragmentIteratorComplexTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorComplexTensorOp_3_01WarpShape___00_01Operato8cf03c624cf3210c71b7cbd580b080f8.html" target="_self">FragmentIteratorComplexTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt.html" target="_self">FragmentIteratorSimt</a></td><td class="desc">Fragment iterator for SIMT accumulator arrangements </td></tr>
<tr id="row_0_2_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorSimt_3_01WarpShape___00_01Operator___00_01la3f2abc523201c1b0228df99119ab88e1.html" target="_self">FragmentIteratorSimt&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp.html" target="_self">FragmentIteratorTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_e459aab140a2ce78336e584f95886726.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Dedicated to interleaved layout </td></tr>
<tr id="row_0_2_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorTensorOp_3_01WarpShape___00_01OperatorShape_5e78dabe303f20d76b00c600aab61eda.html" target="_self">FragmentIteratorTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp.html" target="_self">FragmentIteratorVoltaTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gdb805a2dc5571ac3b66e0fe6ffdcede2.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1G16e08718cffa0989cce3fe8dbc4b075b.html" target="_self">FragmentIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp.html" target="_self">FragmentIteratorWmmaTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1FragmentIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShfdb1f120c6797383663f9fd11d0fc599.html" target="_self">FragmentIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major shared memory </td></tr>
<tr id="row_0_2_2_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy.html" target="_self">SimtPolicy</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1SimtPolicy_3_01WarpShape___00_01Operator___00_01layout_1_1Rcef1c60e23e997017ae176c92931151d.html" target="_self">SimtPolicy&lt; WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
<tr id="row_0_2_2_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy.html" target="_self">TensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
<tr id="row_0_2_2_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout69549d10c3610d943987eb90e827bc05.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::ColumnMajorInterleaved&lt; InterleavedK &gt; &gt;</a></td><td class="desc">Partial specialization for column-major-interleaved </td></tr>
<tr id="row_0_2_2_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TensorOpPolicy_3_01WarpShape_00_01OperatorShape_00_01layout_1_1RowMajor_01_4.html" target="_self">TensorOpPolicy&lt; WarpShape, OperatorShape, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
<tr id="row_0_2_2_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt.html" target="_self">TileIteratorSimt</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorSimt_3_01WarpShape___00_01Operator___00_01Elemenf2bd262ed3e202b25d5802d83965bf3b.html" target="_self">TileIteratorSimt&lt; WarpShape_, Operator_, Element_, layout::RowMajor, MmaSimtPolicy_ &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp.html" target="_self">TileIteratorTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_20_" class="arrow" onclick="toggleFolder('0_2_2_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___003cbb32beb84b4984cb7853662096d289.html" target="_self">TileIteratorTensorOp&lt; WarpShape_, OperatorShape_, Element_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorTensorOp_3_01WarpShape___00_01OperatorShape___05f11e023c9e6ee5f7a888fa4c5bbf6d1.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp.html" target="_self">TileIteratorVoltaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_22_" class="arrow" onclick="toggleFolder('0_2_2_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmS2fe0c60b727c738c622c18fc3dd76644.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemm770cbca45441d295d5d7433e8222a700.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_2_2_23_" class="arrow" onclick="toggleFolder('0_2_2_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1GemmSa0ceeeddc22575876eb977da7f5416a8.html" target="_self">TileIteratorVoltaTensorOp&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1TileIteratorVoltaTensorOp_3_01WarpShape___00_01gemm_1_1Gemmffcab2297c8de8d0013602a39c525b78.html" target="_self">Detail</a></td><td class="desc"></td></tr>
<tr id="row_0_2_2_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp.html" target="_self">TileIteratorWmmaTensorOp</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1warp_1_1TileIteratorWmmaTensorOp_3_01WarpShape___00_01OperatorShape_fd6a91cd8bbd07ecd1344326b830e3a4.html" target="_self">TileIteratorWmmaTensorOp&lt; WarpShape_, OperatorShape_, OperatorFragment_, layout::RowMajor &gt;</a></td><td class="desc">Template for reading and writing tiles of accumulators to shared memory </td></tr>
<tr id="row_0_2_2_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy.html" target="_self">VoltaTensorOpPolicy</a></td><td class="desc">Policy details related to the epilogue </td></tr>
<tr id="row_0_2_2_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_136ce744d4c1c6e8707f5a9785196194.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, float, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
<tr id="row_0_2_2_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1warp_1_1VoltaTensorOpPolicy_3_01WarpShape___00_01gemm_1_1GemmShape_1d48185f49e4d066f8e9327bf0856b7f.html" target="_self">VoltaTensorOpPolicy&lt; WarpShape_, gemm::GemmShape&lt; 32, 32, 4 &gt;, half_t, layout::RowMajor &gt;</a></td><td class="desc">Partial specialization for row-major </td></tr>
<tr id="row_0_2_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_2_3_" class="arrow" onclick="toggleFolder('0_2_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1epilogue_1_1EpilogueWorkspace.html" target="_self">EpilogueWorkspace</a></td><td class="desc"></td></tr>
<tr id="row_0_2_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_2_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1epilogue_1_1EpilogueWorkspace_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage allocation needed by the epilogue </td></tr>
<tr id="row_0_3_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_3_" class="arrow" onclick="toggleFolder('0_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm.html" target="_self">gemm</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_0_" class="arrow" onclick="toggleFolder('0_3_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration.html" target="_self">DefaultGemmConfiguration</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag286687c5e6abe22d241f789fe344a465.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassSimt_00_01ArchTag3026e48abb8c905d1cc6d13d669700e4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassSimt, ArchTag, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc567cad318a31d04b70ea615d6321decd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm70, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcde61af9be1337dac1fdb210e7e7a6e01.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc485a4f0b5a7d2d4ab2c1a24da6328048.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8e2604a56dff3a7595da9ee0604ae55e.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc4fada4957d463c80a2831e47f28157c4.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arc8ab5fd2693c6a6ec43e447acb07f784c.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, int8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcffcf31256aed23d4d8d0eab627bc0cad.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, int4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb2e258b7bd321c633dd65d3ebcf6414a.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint4b_t, uint4b_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcb27bf218007928652d5b803193eab473.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, int8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassTensorOp_00_01arcfea0f3503156e8e3fba6456f0cedafdd.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassTensorOp, arch::Sm75, uint8_t, uint8_t, ElementC, int32_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1DefaultGemmConfiguration_3_01arch_1_1OpClassWmmaTensorOp_00_0884059ecad03bea3e86c4cf722226097.html" target="_self">DefaultGemmConfiguration&lt; arch::OpClassWmmaTensorOp, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_14_" class="arrow" onclick="toggleFolder('0_3_0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_15_" class="arrow" onclick="toggleFolder('0_3_0_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html" target="_self">Gemm&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
<tr id="row_0_3_0_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layou1b211cc9c97c022d8fe10f2dd32c8709.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_16_" class="arrow" onclick="toggleFolder('0_3_0_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_17_" class="arrow" onclick="toggleFolder('0_3_0_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html" target="_self">GemmBatched&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
<tr id="row_0_3_0_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_213d78696663f4231cd52c6a277c60e5.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_18_" class="arrow" onclick="toggleFolder('0_3_0_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex.html" target="_self">GemmComplex</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_19_" class="arrow" onclick="toggleFolder('0_3_0_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html" target="_self">GemmComplex&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial &gt;</a></td><td class="desc">Partial specialization for column-major output exchanges problem size and operand </td></tr>
<tr id="row_0_3_0_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_a3923967cafb5cb9774c320dc24baa77.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_20_" class="arrow" onclick="toggleFolder('0_3_0_20_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
<tr id="row_0_3_0_20_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_1_1Arguments.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_0_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_0_21_" class="arrow" onclick="toggleFolder('0_3_0_21_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01ElementBbe7c1f7154ad5b5bf9d4d28301e2b457.html" target="_self">GemmSplitKParallel&lt; ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ConvertScaledOp_, ReductionOp_, ThreadblockSwizzle_, Stages, kAlignmentA, kAlignmentB, Operator_ &gt;</a></td><td class="desc">Partial specialization for column-major output </td></tr>
<tr id="row_0_3_0_21_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1device_1_1GemmSplitKParallel_3_01ElementA___00_01LayoutA___00_01Elementafcb1aeaf2035a7ac769d7acc233423b.html" target="_self">Arguments</a></td><td class="desc">Argument structure </td></tr>
<tr id="row_0_3_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_1_" class="arrow" onclick="toggleFolder('0_3_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_0_" class="arrow" onclick="toggleFolder('0_3_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1detail_1_1GemvBatchedStridedEpilogueScaling.html" target="_self">GemvBatchedStridedEpilogueScaling</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm.html" target="_self">DefaultGemm</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01layout_1_1ColumnMajorInterleave661fe54d13cc2c9153dcdf31e4beaa30.html" target="_self">DefaultGemm&lt; ElementA, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, kAlignmentA, ElementB, layout::RowMajorInterleaved&lt; InterleavedK &gt;, kAlignmentB, ElementC, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, int32_t, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, IsBetaZero &gt;</a></td><td class="desc">Partial specialization for Turing Integer Matrix Multiply Interleaved layout </td></tr>
<tr id="row_0_3_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01Edd80343e6570718ed237122e4ebf7fb5.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 1 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for SIMT </td></tr>
<tr id="row_0_3_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E044b039b2fe402f29b04a9f5feee5342.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm70, ThreadblockShape, WarpShape, GemmShape&lt; 8, 8, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Volta architecture </td></tr>
<tr id="row_0_3_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00_01E5d78d37a9ae2ec08d7d477d571df036e.html" target="_self">DefaultGemm&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator &gt;</a></td><td class="desc">Partial specialization for Turing Architecture </td></tr>
<tr id="row_0_3_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemm_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_01inf48440732c1c5f42ddbfaba179861815.html" target="_self">DefaultGemm&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementC, LayoutC, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, false &gt;</a></td><td class="desc">Partial specialization for SIMT DP4A </td></tr>
<tr id="row_0_3_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemmSplitKParallel.html" target="_self">DefaultGemmSplitKParallel</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1DefaultGemv.html" target="_self">DefaultGemv</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_9_" class="arrow" onclick="toggleFolder('0_3_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1Gemm_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_3_1_9_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1Gemm_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
<tr id="row_0_3_1_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_10_" class="arrow" onclick="toggleFolder('0_3_1_10_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched.html" target="_self">GemmBatched</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_10_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_3_1_10_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmBatched_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
<tr id="row_0_3_1_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_1_11_" class="arrow" onclick="toggleFolder('0_3_1_11_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel.html" target="_self">GemmSplitKParallel</a></td><td class="desc"></td></tr>
<tr id="row_0_3_1_11_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_3_1_11_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="unioncutlass_1_1gemm_1_1kernel_1_1GemmSplitKParallel_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared memory storage structure </td></tr>
<tr id="row_0_3_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_2_" class="arrow" onclick="toggleFolder('0_3_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_2_0_" class="arrow" onclick="toggleFolder('0_3_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1thread_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1EnableMma__Crow__SM60.html" target="_self">EnableMma_Crow_SM60</a></td><td class="desc">Determines whether to enable thread::Gemm&lt;&gt; specializations compatible with SM50 </td></tr>
<tr id="row_0_3_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2.html" target="_self">Mma_HFMA2</a></td><td class="desc">Structure to compute the matrix product for HFMA </td></tr>
<tr id="row_0_3_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_72621f7ab9ae4a4ba4fe9725cf8e89c1.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_94c813e3bbfb6f9857c155166f772687.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_17070298bc4cced0a1b98aee2bb6b455.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1ColumnMajor_00_bf6d29bb09a025e7b96942809743e28a.html" target="_self">Mma_HFMA2&lt; Shape, layout::ColumnMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l26a133b13650c1d058273e3649f60f04.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01lbba3a796be96a0276693ef6b259ecc4a.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::ColumnMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l2aa4d2fd2e940e0d0cf7c47bc8f6017c.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::ColumnMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01layout_1_1RowMajor_00_01l086c058a15d6c79558e4f3d9ff1dc148.html" target="_self">Mma_HFMA2&lt; Shape, layout::RowMajor, layout::RowMajor, layout::RowMajor, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_0e1104c65871c539155bd3a0c7631928b.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::ColumnMajor, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1detail_1_1Mma__HFMA2_3_01Shape_00_01LayoutA_00_01LayoutB_00_07ac147cb320ee0d28ff8e78eb4cd330e.html" target="_self">Mma_HFMA2&lt; Shape, LayoutA, LayoutB, layout::RowMajor, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma.html" target="_self">Mma</a></td><td class="desc">Structure to compute the matrix product </td></tr>
<tr id="row_0_3_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01ElementA___00_01LayoutA___00_01ElementB_e41c1cd6078b6d1347fac239b0639d56.html" target="_self">Mma&lt; Shape_, ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for FFMA and DFMA GEMM </td></tr>
<tr id="row_0_3_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA_00_01half__t_00_01L066c9d2371712cdf0cac099ca9bcc578.html" target="_self">Mma&lt; Shape_, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Structure to compute the matrix product </td></tr>
<tr id="row_0_3_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01half__t_00_01LayoutA___00_01half__t_00_088f0e99e501b6012297eb30b4e89bcea.html" target="_self">Mma&lt; Shape_, half_t, LayoutA_, half_t, LayoutB_, half_t, layout::RowMajor, arch::OpMultiplyAdd, typename platform::enable_if&lt; detail::EnableMma_Crow_SM60&lt; LayoutA_, LayoutB_ &gt;::value &gt;::type &gt;</a></td><td class="desc">Computes matrix product when C is row-major </td></tr>
<tr id="row_0_3_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1ColumnMajor_00_013f3785e722edc6e9aab6f866309b8623.html" target="_self">Mma&lt; Shape_, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, int8_t &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
<tr id="row_0_3_2_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1Mma_3_01Shape___00_01int8__t_00_01layout_1_1RowMajor_00_01int89c659e7faf47264972bdba6cd80f42b.html" target="_self">Mma&lt; Shape_, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, bool &gt;</a></td><td class="desc">Gemplate that handles conventional layouts for IDP4A </td></tr>
<tr id="row_0_3_2_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1thread_1_1MmaGeneric.html" target="_self">MmaGeneric</a></td><td class="desc">Gemplate that handles all packed matrix layouts </td></tr>
<tr id="row_0_3_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_3_" class="arrow" onclick="toggleFolder('0_3_3_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultGemvCore.html" target="_self">DefaultGemvCore</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma.html" target="_self">DefaultMma</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_0010764e1fd5a3251a57eddafbd83eab8e.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, true &gt;</a></td><td class="desc">Specialization for column-major-interleaved output </td></tr>
<tr id="row_0_3_3_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00c67c16f9881e4f2fda76d8ed83ebabd6.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
<tr id="row_0_3_3_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01ElementA_00_01LayoutA_00_01kAlignmentA_00ce36642cae579bce6605ff8edde3c6ab.html" target="_self">DefaultMma&lt; ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false &gt;</a></td><td class="desc">Specialization for row-major output (OperatorClass Simt) </td></tr>
<tr id="row_0_3_3_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMma_3_01int8__t_00_01LayoutA_00_01kAlignmentA_00_07e7230d4011ada5e22cfcb29103b696.html" target="_self">DefaultMma&lt; int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape&lt; 1, 1, 4 &gt;, 2, Operator, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore.html" target="_self">DefaultMmaCore</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab94a11a77dd0565102710907089acee0.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShafafd5c61db86cbfe90863578ddd11092.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha46446d1e3871e31d2e728f710d78c8c1.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_, &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha8da7a0cfbbe859b701fdd9f2b8566aa7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha84e9f8afb6a4ca9f5dcd219b182d16e7.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 1 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha2c0d0b7cdb5c4bcb11e83c058eb65345.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
<tr id="row_0_3_3_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha34a52cc7b2942e8c290f0032b6779b52.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_14_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaaf312aafe9da92ea9d417bcc12a8e7dc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
<tr id="row_0_3_3_15_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha863d4139ccaa713bc4bde32c425f4067.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 1, 1, 4 &gt;, int8_t, layout::RowMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ &gt;</a></td><td class="desc">Partial specialization: </td></tr>
<tr id="row_0_3_3_16_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShaf03a122202ad10acdc96f280106d678b.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha69bef08ea63dd930f99d9788105873dd.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmSha3adf608332a8c9ee7014fced0da8a9ca.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01GemmShab7edfba3cdf43a07e3c4d719d87565a4.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, GemmShape&lt; 8, 8, 4 &gt;, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc803d38bc1e4618c07c47f54c87ae2678.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instrucf60fe02fcdd80d28b7fd419133465dcc.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc2bf00737f4ad0a9da9a8be6d3e66c152.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, ElementB_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_, AccumulatorsInRowMajor &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc24092ddc01fc83dabb7db4c14880fe60.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1DefaultMmaCore_3_01Shape___00_01WarpShape___00_01Instruc4fee9f2965b8468bfb42b94a74527d22.html" target="_self">DefaultMmaCore&lt; Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmBatchedIdentityThreadblockSwizzle.html" target="_self">GemmBatchedIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMMs </td></tr>
<tr id="row_0_3_3_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmHorizontalThreadblockSwizzle.html" target="_self">GemmHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
<tr id="row_0_3_3_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmIdentityThreadblockSwizzle.html" target="_self">GemmIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for GEMMs </td></tr>
<tr id="row_0_3_3_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKHorizontalThreadblockSwizzle.html" target="_self">GemmSplitKHorizontalThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
<tr id="row_0_3_3_29_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemmSplitKIdentityThreadblockSwizzle.html" target="_self">GemmSplitKIdentityThreadblockSwizzle</a></td><td class="desc">Threadblock swizzling function for split-K GEMMs </td></tr>
<tr id="row_0_3_3_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1Gemv.html" target="_self">Gemv</a></td><td class="desc">Structure to compute the matrix-vector product using SIMT math instructions </td></tr>
<tr id="row_0_3_3_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1GemvBatchedStridedThreadblockDefaultSwizzle.html" target="_self">GemvBatchedStridedThreadblockDefaultSwizzle</a></td><td class="desc">Threadblock swizzling function for batched GEMVs </td></tr>
<tr id="row_0_3_3_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_3_32_" class="arrow" onclick="toggleFolder('0_3_3_32_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase.html" target="_self">MmaBase</a></td><td class="desc"></td></tr>
<tr id="row_0_3_3_32_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaBase_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc">Shared storage object needed by threadblock-scoped GEMM </td></tr>
<tr id="row_0_3_3_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaPipelined.html" target="_self">MmaPipelined</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
<tr id="row_0_3_3_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1threadblock_1_1MmaPolicy.html" target="_self">MmaPolicy</a></td><td class="desc">Policy object describing MmaTensorOp </td></tr>
<tr id="row_0_3_3_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1threadblock_1_1MmaSingleStage.html" target="_self">MmaSingleStage</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
<tr id="row_0_3_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_3_4_" class="arrow" onclick="toggleFolder('0_3_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1gemm_1_1warp.html" target="_self">warp</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1DefaultMmaTensorOp.html" target="_self">DefaultMmaTensorOp</a></td><td class="desc">Partial specialization for m-by-n-by-kgroup </td></tr>
<tr id="row_0_3_4_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp.html" target="_self">MmaComplexTensorOp</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaComplexTensorOp_3_01Shape___00_01complex_3_01RealElementA_01_146441010dad1f40eb51b6dae3ded216.html" target="_self">MmaComplexTensorOp&lt; Shape_, complex&lt; RealElementA &gt;, LayoutA_, complex&lt; RealElementB &gt;, LayoutB_, complex&lt; RealElementC &gt;, LayoutC_, Policy_, TransformA, TransformB, Enable &gt;</a></td><td class="desc">Partial specialization for complex*complex+complex =&gt; complex using real-valued TensorOps </td></tr>
<tr id="row_0_3_4_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimt.html" target="_self">MmaSimt</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
<tr id="row_0_3_4_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaSimtPolicy.html" target="_self">MmaSimtPolicy</a></td><td class="desc">Describes the arrangement and configuration of per-lane operations in warp-level matrix multiply </td></tr>
<tr id="row_0_3_4_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator.html" target="_self">MmaSimtTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_67ca7e11a38e38f2c51b84767654a90f.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kA_00_01Element_f0ce904a9294556f15e1cc9cf7c99a93.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kA, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ea0a4e7ce3cd5d25cabf79383efdf4d9.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajor, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kB_00_01Element_ada156b62fcbdce47009c5bf1321c92c.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kB, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, Policy_, PartitionsK, PartitionGroupSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_4ccafbc821b3a55cd532602442a74031.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::ColumnMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaSimtTileIterator_3_01Shape___00_01Operand_1_1kC_00_01Element_8f92ea79e85febb67169c4b2d94b1b20.html" target="_self">MmaSimtTileIterator&lt; Shape_, Operand::kC, Element_, layout::RowMajor, Policy_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_12_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOp.html" target="_self">MmaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
<tr id="row_0_3_4_13_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator.html" target="_self">MmaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_14_" class="arrow" onclick="toggleFolder('0_3_4_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___008f607b871a2b3d854eb4def64712c042.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___0d35fa5dc4e4b4f72784c943fd857fc1d.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_15_" class="arrow" onclick="toggleFolder('0_3_4_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___00027dabdc144edd6276f664ca74088510.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::ColumnMajorInterleaved&lt; InterleavedN &gt;, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___03822d9be37f3725022005a5434441f22.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_16_" class="arrow" onclick="toggleFolder('0_3_4_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___006c39f57875e0aa9d0ad82c8043ed8b98.html" target="_self">MmaTensorOpAccumulatorTileIterator&lt; Shape_, Element_, cutlass::layout::RowMajor, InstructionShape_, OpDelta_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpAccumulatorTileIterator_3_01Shape___00_01Element___093b5d2838ac5a742704ef62b5c8688f0.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_17_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator.html" target="_self">MmaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_18_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0b84f53cd44b339eccc12067c9f86e11c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_19_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0e52ad425e1ee3e68544873f66733237b.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___039819fb3ccd43786d556c2c9669508ef.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0352e0dcab42bc8360606874e00173556.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_22_" class="arrow" onclick="toggleFolder('0_3_4_22_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0ed7daaeba1c095e77f68533d4d2c475c.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, 64 &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_22_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___07638f8b7761f6e2e2e6918e2c05e739.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_23_" class="arrow" onclick="toggleFolder('0_3_4_23_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0c7d419c589d601ce4eb603be566fea21.html" target="_self">MmaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, InstructionShape_, OpDelta_, 32, PartitionsK_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_23_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand___0784c74bd670999ec23ad8ef9dc55777.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaTensorOpPolicy.html" target="_self">MmaTensorOpPolicy</a></td><td class="desc">Policy </td></tr>
<tr id="row_0_3_4_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOp.html" target="_self">MmaVoltaTensorOp</a></td><td class="desc">Structure to compute the matrix product targeting CUDA cores and SIMT math instructions </td></tr>
<tr id="row_0_3_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_26_" class="arrow" onclick="toggleFolder('0_3_4_26_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator.html" target="_self">MmaVoltaTensorOpAccumulatorTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_26_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpAccumulatorTileIterator_1_1Policy.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_28_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan0d3248553e52cd61ed8a2b3b12a20343.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_29_" class="arrow" onclick="toggleFolder('0_3_4_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan34be8e21a40af3ebd2dc3dff460dca72.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kA, Element_, cutlass::layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera33cdf53848564e894d4407637dc86caf.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operand734577b7e54a074d143aba59828c2f2.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_31_" class="arrow" onclick="toggleFolder('0_3_4_31_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan16c56cdc2dda5eeb996af8ec0242d501.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand::kB, Element_, cutlass::layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_31_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Opera6fa6d2d3725bb3ec613d5c527ea3ffe7.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operan5a221944f4a0e16ccab77ba684856942.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operandcc9821c435540895138bc9af495f321.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_3_4_34_" class="arrow" onclick="toggleFolder('0_3_4_34_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operana2f40b28f0d2286b84d86f7238d67b52.html" target="_self">MmaVoltaTensorOpMultiplicandTileIterator&lt; Shape_, Operand_, Element_, cutlass::layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, KBlock &gt;, InstructionShape_, OpDelta_, 32 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_3_4_34_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1MmaVoltaTensorOpMultiplicandTileIterator_3_01Shape___00_01Operafa294175b280756dd8388f9ffe7b72c4.html" target="_self">Policy</a></td><td class="desc">Internal structure of iterator - made public to enable introspection </td></tr>
<tr id="row_0_3_4_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1warp_1_1WarpSize.html" target="_self">WarpSize</a></td><td class="desc">Query the number of threads per warp </td></tr>
<tr id="row_0_3_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1BatchedGemmCoord.html" target="_self">BatchedGemmCoord</a></td><td class="desc"></td></tr>
<tr id="row_0_3_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmCoord.html" target="_self">GemmCoord</a></td><td class="desc"></td></tr>
<tr id="row_0_3_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1gemm_1_1GemmShape.html" target="_self">GemmShape</a></td><td class="desc">Shape of a matrix multiply-add operation </td></tr>
<tr id="row_0_4_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_4_" class="arrow" onclick="toggleFolder('0_4_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1layout.html" target="_self">layout</a></td><td class="desc"></td></tr>
<tr id="row_0_4_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1ColumnMajor.html" target="_self">ColumnMajor</a></td><td class="desc">Mapping function for column-major matrices </td></tr>
<tr id="row_0_4_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorBlockLinear.html" target="_self">ColumnMajorBlockLinear</a></td><td class="desc"></td></tr>
<tr id="row_0_4_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorInterleaved.html" target="_self">ColumnMajorInterleaved</a></td><td class="desc"></td></tr>
<tr id="row_0_4_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
<tr id="row_0_4_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_4_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
<tr id="row_0_4_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a column-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
<tr id="row_0_4_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ColumnMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">ColumnMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_4_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1ContiguousMatrix.html" target="_self">ContiguousMatrix</a></td><td class="desc"></td></tr>
<tr id="row_0_4_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1GeneralMatrix.html" target="_self">GeneralMatrix</a></td><td class="desc"></td></tr>
<tr id="row_0_4_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose.html" target="_self">LayoutTranspose</a></td><td class="desc">Defines transposes of matrix layouts </td></tr>
<tr id="row_0_4_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1ColumnMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::ColumnMajor &gt;</a></td><td class="desc">Transpose of column-major is row-major </td></tr>
<tr id="row_0_4_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1LayoutTranspose_3_01layout_1_1RowMajor_01_4.html" target="_self">LayoutTranspose&lt; layout::RowMajor &gt;</a></td><td class="desc">Transpose of row-major is column-major </td></tr>
<tr id="row_0_4_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PackedVectorLayout.html" target="_self">PackedVectorLayout</a></td><td class="desc">Tensor layout for densely packed vectors </td></tr>
<tr id="row_0_4_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1PitchLinear.html" target="_self">PitchLinear</a></td><td class="desc">Mapping function for pitch-linear memory </td></tr>
<tr id="row_0_4_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearCoord.html" target="_self">PitchLinearCoord</a></td><td class="desc">Coordinate in pitch-linear space </td></tr>
<tr id="row_0_4_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1PitchLinearShape.html" target="_self">PitchLinearShape</a></td><td class="desc">Template defining a shape used by pitch-linear operators </td></tr>
<tr id="row_0_4_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1RowMajor.html" target="_self">RowMajor</a></td><td class="desc">Mapping function for row-major matrices </td></tr>
<tr id="row_0_4_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorBlockLinear.html" target="_self">RowMajorBlockLinear</a></td><td class="desc"></td></tr>
<tr id="row_0_4_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorInterleaved.html" target="_self">RowMajorInterleaved</a></td><td class="desc"></td></tr>
<tr id="row_0_4_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCongruous.html" target="_self">RowMajorTensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
<tr id="row_0_4_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorTensorOpMultiplicandCrosswise.html" target="_self">RowMajorTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_4_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandBCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
<tr id="row_0_4_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCongruous.html" target="_self">RowMajorVoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template mapping a row-major view of pitch-linear memory to <a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" title="Template based on element size (in bits) - defined in terms of pitch-linear memory. ">VoltaTensorOpMultiplicandCongruous</a> </td></tr>
<tr id="row_0_4_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1RowMajorVoltaTensorOpMultiplicandCrosswise.html" target="_self">RowMajorVoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_4_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorCxRSKx.html" target="_self">TensorCxRSKx</a></td><td class="desc">Mapping function for 4-D CxRSKx tensors </td></tr>
<tr id="row_0_4_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCHW.html" target="_self">TensorNCHW</a></td><td class="desc">Mapping function for 4-D NCHW tensors </td></tr>
<tr id="row_0_4_27_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNCxHWx.html" target="_self">TensorNCxHWx</a></td><td class="desc">Mapping function for 4-D NC/xHWx tensors </td></tr>
<tr id="row_0_4_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1layout_1_1TensorNHWC.html" target="_self">TensorNHWC</a></td><td class="desc">Mapping function for 4-D NHWC tensors </td></tr>
<tr id="row_0_4_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicand.html" target="_self">TensorOpMultiplicand</a></td><td class="desc"></td></tr>
<tr id="row_0_4_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandColumnMajorInterleaved.html" target="_self">TensorOpMultiplicandColumnMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
<tr id="row_0_4_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous.html" target="_self">TensorOpMultiplicandCongruous</a></td><td class="desc"></td></tr>
<tr id="row_0_4_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCongruous_3_0132_00_01Crosswise_01_4.html" target="_self">TensorOpMultiplicandCongruous&lt; 32, Crosswise &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_4_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandCrosswise.html" target="_self">TensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_4_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1TensorOpMultiplicandRowMajorInterleaved.html" target="_self">TensorOpMultiplicandRowMajorInterleaved</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
<tr id="row_0_4_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandBCongruous.html" target="_self">VoltaTensorOpMultiplicandBCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
<tr id="row_0_4_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCongruous.html" target="_self">VoltaTensorOpMultiplicandCongruous</a></td><td class="desc">Template based on element size (in bits) - defined in terms of pitch-linear memory </td></tr>
<tr id="row_0_4_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1layout_1_1VoltaTensorOpMultiplicandCrosswise.html" target="_self">VoltaTensorOpMultiplicandCrosswise</a></td><td class="desc"></td></tr>
<tr id="row_0_5_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_5_" class="arrow" onclick="toggleFolder('0_5_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1library.html" target="_self">library</a></td><td class="desc"></td></tr>
<tr id="row_0_5_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArguments.html" target="_self">GemmArguments</a></td><td class="desc">Arguments for GEMM </td></tr>
<tr id="row_0_5_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayArguments.html" target="_self">GemmArrayArguments</a></td><td class="desc">Arguments for GEMM - used by all the GEMM operations </td></tr>
<tr id="row_0_5_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmArrayConfiguration.html" target="_self">GemmArrayConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
<tr id="row_0_5_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmBatchedConfiguration.html" target="_self">GemmBatchedConfiguration</a></td><td class="desc">Configuration for batched GEMM in which multiple matrix products are computed </td></tr>
<tr id="row_0_5_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmConfiguration.html" target="_self">GemmConfiguration</a></td><td class="desc">Configuration for basic GEMM operations </td></tr>
<tr id="row_0_5_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmDescription.html" target="_self">GemmDescription</a></td><td class="desc">Description of all GEMM computations </td></tr>
<tr id="row_0_5_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexBatchedConfiguration.html" target="_self">GemmPlanarComplexBatchedConfiguration</a></td><td class="desc">Batched complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
<tr id="row_0_5_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1GemmPlanarComplexConfiguration.html" target="_self">GemmPlanarComplexConfiguration</a></td><td class="desc">Complex valued GEMM in which real and imaginary parts are separated by a stride </td></tr>
<tr id="row_0_5_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Manifest.html" target="_self">Manifest</a></td><td class="desc"><a class="el" href="classcutlass_1_1library_1_1Manifest.html" title="Manifest of CUTLASS Library. ">Manifest</a> of CUTLASS Library </td></tr>
<tr id="row_0_5_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1MathInstructionDescription.html" target="_self">MathInstructionDescription</a></td><td class="desc"></td></tr>
<tr id="row_0_5_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1library_1_1Operation.html" target="_self">Operation</a></td><td class="desc">Base class for all device-wide operations </td></tr>
<tr id="row_0_5_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1OperationDescription.html" target="_self">OperationDescription</a></td><td class="desc">High-level description of an operation </td></tr>
<tr id="row_0_5_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TensorDescription.html" target="_self">TensorDescription</a></td><td class="desc">Structure describing the properties of a tensor </td></tr>
<tr id="row_0_5_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1library_1_1TileDescription.html" target="_self">TileDescription</a></td><td class="desc">Structure describing the tiled structure of a GEMM-like computation </td></tr>
<tr id="row_0_6_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_6_" class="arrow" onclick="toggleFolder('0_6_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1platform.html" target="_self">platform</a></td><td class="desc"></td></tr>
<tr id="row_0_6_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__chunk.html" target="_self">aligned_chunk</a></td><td class="desc"></td></tr>
<tr id="row_0_6_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1aligned__storage.html" target="_self">aligned_storage</a></td><td class="desc">Std::aligned_storage </td></tr>
<tr id="row_0_6_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_2_" class="arrow" onclick="toggleFolder('0_6_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of.html" target="_self">alignment_of</a></td><td class="desc">Std::alignment_of </td></tr>
<tr id="row_0_6_2_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_1_1pad.html" target="_self">pad</a></td><td class="desc"></td></tr>
<tr id="row_0_6_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01value__t_01_4.html" target="_self">alignment_of&lt; const value_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01const_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; const volatile value_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double2_01_4.html" target="_self">alignment_of&lt; double2 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01double4_01_4.html" target="_self">alignment_of&lt; double4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01float4_01_4.html" target="_self">alignment_of&lt; float4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01int4_01_4.html" target="_self">alignment_of&lt; int4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01long4_01_4.html" target="_self">alignment_of&lt; long4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong2_01_4.html" target="_self">alignment_of&lt; longlong2 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01longlong4_01_4.html" target="_self">alignment_of&lt; longlong4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01uint4_01_4.html" target="_self">alignment_of&lt; uint4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulong4_01_4.html" target="_self">alignment_of&lt; ulong4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong2_01_4.html" target="_self">alignment_of&lt; ulonglong2 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01ulonglong4_01_4.html" target="_self">alignment_of&lt; ulonglong4 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1alignment__of_3_01volatile_01value__t_01_4.html" target="_self">alignment_of&lt; volatile value_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1bool__constant.html" target="_self">bool_constant</a></td><td class="desc">Std::bool_constant </td></tr>
<tr id="row_0_6_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional.html" target="_self">conditional</a></td><td class="desc">Std::conditional (true specialization) </td></tr>
<tr id="row_0_6_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1conditional_3_01false_00_01T_00_01F_01_4.html" target="_self">conditional&lt; false, T, F &gt;</a></td><td class="desc">Std::conditional (false specialization) </td></tr>
<tr id="row_0_6_20_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete.html" target="_self">default_delete</a></td><td class="desc">Default deleter </td></tr>
<tr id="row_0_6_21_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1default__delete_3_01T[]_4.html" target="_self">default_delete&lt; T[]&gt;</a></td><td class="desc">Partial specialization for deleting array types </td></tr>
<tr id="row_0_6_22_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if.html" target="_self">enable_if</a></td><td class="desc">Std::enable_if (true specialization) </td></tr>
<tr id="row_0_6_23_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1enable__if_3_01false_00_01T_01_4.html" target="_self">enable_if&lt; false, T &gt;</a></td><td class="desc">Std::enable_if (false specialization) </td></tr>
<tr id="row_0_6_24_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1integral__constant.html" target="_self">integral_constant</a></td><td class="desc">Std::integral_constant </td></tr>
<tr id="row_0_6_25_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__arithmetic.html" target="_self">is_arithmetic</a></td><td class="desc">Std::is_arithmetic </td></tr>
<tr id="row_0_6_26_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of.html" target="_self">is_base_of</a></td><td class="desc">Std::is_base_of </td></tr>
<tr id="row_0_6_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_6_27_" class="arrow" onclick="toggleFolder('0_6_27_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper.html" target="_self">is_base_of_helper</a></td><td class="desc">Helper for std::is_base_of </td></tr>
<tr id="row_0_6_27_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__base__of__helper_1_1dummy.html" target="_self">dummy</a></td><td class="desc"></td></tr>
<tr id="row_0_6_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__floating__point.html" target="_self">is_floating_point</a></td><td class="desc">Std::is_floating_point </td></tr>
<tr id="row_0_6_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__fundamental.html" target="_self">is_fundamental</a></td><td class="desc">Std::is_fundamental </td></tr>
<tr id="row_0_6_30_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral.html" target="_self">is_integral</a></td><td class="desc">Std::is_integral </td></tr>
<tr id="row_0_6_31_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01char_01_4.html" target="_self">is_integral&lt; char &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_32_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01T_01_4.html" target="_self">is_integral&lt; const T &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_33_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01const_01volatile_01T_01_4.html" target="_self">is_integral&lt; const volatile T &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_34_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01int_01_4.html" target="_self">is_integral&lt; int &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_35_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01_4.html" target="_self">is_integral&lt; long &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_36_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01long_01long_01_4.html" target="_self">is_integral&lt; long long &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_37_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01short_01_4.html" target="_self">is_integral&lt; short &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_38_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01signed_01char_01_4.html" target="_self">is_integral&lt; signed char &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_39_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01char_01_4.html" target="_self">is_integral&lt; unsigned char &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_40_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01int_01_4.html" target="_self">is_integral&lt; unsigned int &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_41_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01_4.html" target="_self">is_integral&lt; unsigned long &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_42_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01long_01long_01_4.html" target="_self">is_integral&lt; unsigned long long &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_43_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01unsigned_01short_01_4.html" target="_self">is_integral&lt; unsigned short &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_44_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__integral_3_01volatile_01T_01_4.html" target="_self">is_integral&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_45_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer.html" target="_self">is_pointer</a></td><td class="desc">Std::is_pointer </td></tr>
<tr id="row_0_6_46_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper.html" target="_self">is_pointer_helper</a></td><td class="desc">Helper for std::is_pointer (false specialization) </td></tr>
<tr id="row_0_6_47_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__pointer__helper_3_01T_01_5_01_4.html" target="_self">is_pointer_helper&lt; T * &gt;</a></td><td class="desc">Helper for std::is_pointer (true specialization) </td></tr>
<tr id="row_0_6_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same.html" target="_self">is_same</a></td><td class="desc">Std::is_same (false specialization) </td></tr>
<tr id="row_0_6_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__same_3_01A_00_01A_01_4.html" target="_self">is_same&lt; A, A &gt;</a></td><td class="desc">Std::is_same (true specialization) </td></tr>
<tr id="row_0_6_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__trivially__copyable.html" target="_self">is_trivially_copyable</a></td><td class="desc"></td></tr>
<tr id="row_0_6_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__void.html" target="_self">is_void</a></td><td class="desc">Std::is_void </td></tr>
<tr id="row_0_6_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile.html" target="_self">is_volatile</a></td><td class="desc">Std::is_volatile </td></tr>
<tr id="row_0_6_53_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1is__volatile_3_01volatile_01T_01_4.html" target="_self">is_volatile&lt; volatile T &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_6_54_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1nullptr__t.html" target="_self">nullptr_t</a></td><td class="desc">Std::nullptr_t </td></tr>
<tr id="row_0_6_55_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const.html" target="_self">remove_const</a></td><td class="desc">Std::remove_const (non-const specialization) </td></tr>
<tr id="row_0_6_56_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__const_3_01const_01T_01_4.html" target="_self">remove_const&lt; const T &gt;</a></td><td class="desc">Std::remove_const (const specialization) </td></tr>
<tr id="row_0_6_57_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__cv.html" target="_self">remove_cv</a></td><td class="desc">Std::remove_cv </td></tr>
<tr id="row_0_6_58_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile.html" target="_self">remove_volatile</a></td><td class="desc">Std::remove_volatile (non-volatile specialization) </td></tr>
<tr id="row_0_6_59_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1platform_1_1remove__volatile_3_01volatile_01T_01_4.html" target="_self">remove_volatile&lt; volatile T &gt;</a></td><td class="desc">Std::remove_volatile (volatile specialization) </td></tr>
<tr id="row_0_6_60_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1platform_1_1unique__ptr.html" target="_self">unique_ptr</a></td><td class="desc">Std::unique_ptr </td></tr>
<tr id="row_0_7_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_7_" class="arrow" onclick="toggleFolder('0_7_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction.html" target="_self">reduction</a></td><td class="desc"></td></tr>
<tr id="row_0_7_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_0_" class="arrow" onclick="toggleFolder('0_7_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
<tr id="row_0_7_0_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_0_0_" class="arrow" onclick="toggleFolder('0_7_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK.html" target="_self">ReduceSplitK</a></td><td class="desc"></td></tr>
<tr id="row_0_7_0_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" target="_self">Params</a></td><td class="desc"><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1Params.html" title="Params structure. ">Params</a> structure </td></tr>
<tr id="row_0_7_0_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1kernel_1_1ReduceSplitK_1_1SharedStorage.html" target="_self">SharedStorage</a></td><td class="desc"></td></tr>
<tr id="row_0_7_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_1_" class="arrow" onclick="toggleFolder('0_7_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reduction_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_7_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" target="_self">Reduce</a></td><td class="desc">Structure to compute the thread level reduction </td></tr>
<tr id="row_0_7_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01AlignedArray_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, AlignedArray&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for AlignedArray&lt;half_t, N&gt; </td></tr>
<tr id="row_0_7_1_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01half__t_01_4_00_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; half_t &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Partial specializations of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;half_t, N&gt; </td></tr>
<tr id="row_0_7_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Partial specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for Array&lt;T, N&gt; </td></tr>
<tr id="row_0_7_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce_3_01plus_3_01T_01_4_00_01T_01_4.html" target="_self">Reduce&lt; plus&lt; T &gt;, T &gt;</a></td><td class="desc">Partial Specialization of <a class="el" href="structcutlass_1_1reduction_1_1thread_1_1Reduce.html" title="Structure to compute the thread level reduction. ">Reduce</a> for "plus" (a functional operator) </td></tr>
<tr id="row_0_7_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_7_1_5_" class="arrow" onclick="toggleFolder('0_7_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd.html" target="_self">ReduceAdd</a></td><td class="desc">Mixed-precision reduction </td></tr>
<tr id="row_0_7_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1thread_1_1ReduceAdd_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
<tr id="row_0_7_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReduction.html" target="_self">BatchedReduction</a></td><td class="desc"></td></tr>
<tr id="row_0_7_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_7_3_" class="arrow" onclick="toggleFolder('0_7_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits.html" target="_self">BatchedReductionTraits</a></td><td class="desc"></td></tr>
<tr id="row_0_7_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1BatchedReductionTraits_1_1Params.html" target="_self">Params</a></td><td class="desc"></td></tr>
<tr id="row_0_7_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reduction_1_1DefaultBlockSwizzle.html" target="_self">DefaultBlockSwizzle</a></td><td class="desc"></td></tr>
<tr id="row_0_8_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_8_" class="arrow" onclick="toggleFolder('0_8_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference.html" target="_self">reference</a></td><td class="desc"></td></tr>
<tr id="row_0_8_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_0_" class="arrow" onclick="toggleFolder('0_8_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
<tr id="row_0_8_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast.html" target="_self">Cast</a></td><td class="desc"></td></tr>
<tr id="row_0_8_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01int8__t_01_4.html" target="_self">Cast&lt; float, int8_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_8_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1detail_1_1Cast_3_01float_00_01uint8__t_01_4.html" target="_self">Cast&lt; float, uint8_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_1_" class="arrow" onclick="toggleFolder('0_8_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device.html" target="_self">device</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_" class="arrow" onclick="toggleFolder('0_8_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1detail.html" target="_self">detail</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_0_" class="arrow" onclick="toggleFolder('0_8_1_0_0_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_0_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_1_" class="arrow" onclick="toggleFolder('0_8_1_0_1_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_1_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1RandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_2_" class="arrow" onclick="toggleFolder('0_8_1_0_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc.html" target="_self">TensorCopyDiagonalInFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_2_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalInFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_3_" class="arrow" onclick="toggleFolder('0_8_1_0_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc.html" target="_self">TensorCopyDiagonalOutFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_3_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorCopyDiagonalOutFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_4_" class="arrow" onclick="toggleFolder('0_8_1_0_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_4_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_5_" class="arrow" onclick="toggleFolder('0_8_1_0_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_5_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillLinearFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_6_" class="arrow" onclick="toggleFolder('0_8_1_0_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc.html" target="_self">TensorFillRandomGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_6_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomGaussianFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_7_" class="arrow" onclick="toggleFolder('0_8_1_0_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_7_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorFillRandomUniformFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_8_" class="arrow" onclick="toggleFolder('0_8_1_0_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc.html" target="_self">TensorUpdateDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_8_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_0_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_0_9_" class="arrow" onclick="toggleFolder('0_8_1_0_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_1_0_9_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1detail_1_1TensorUpdateOffDiagonalFunc_1_1Params.html" target="_self">Params</a></td><td class="desc">Parameters structure </td></tr>
<tr id="row_0_8_1_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_" class="arrow" onclick="toggleFolder('0_8_1_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel.html" target="_self">kernel</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span id="arr_0_8_1_1_0_" class="arrow" onclick="toggleFolder('0_8_1_1_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1kernel_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
<tr id="row_0_8_1_1_0_0_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
<tr id="row_0_8_1_1_0_1_" style="display:none;"><td class="entry"><span style="width:96px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1kernel_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
<tr id="row_0_8_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_1_2_" class="arrow" onclick="toggleFolder('0_8_1_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1device_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1thread_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc">Thread-level blocked general matrix product </td></tr>
<tr id="row_0_8_1_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
<tr id="row_0_8_1_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout4e016ab7cfc644acd7cb4ae770339773.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
<tr id="row_0_8_1_6_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout30b72addd464a2ca4a26785cbfd77a8e.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
<tr id="row_0_8_1_7_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01Layout660562b232f408218828ca5915b7e73a.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc &gt;</a></td><td class="desc">Partial specialization for XOR-popc </td></tr>
<tr id="row_0_8_1_8_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorDiagonalForEach.html" target="_self">TensorDiagonalForEach</a></td><td class="desc">Launches a kernel calling a functor for each element along a tensor's diagonal </td></tr>
<tr id="row_0_8_1_9_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1device_1_1TensorForEach.html" target="_self">TensorForEach</a></td><td class="desc">Launches a kernel calling a functor for each element in a tensor's index space </td></tr>
<tr id="row_0_8_2_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_8_2_" class="arrow" onclick="toggleFolder('0_8_2_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host.html" target="_self">host</a></td><td class="desc"></td></tr>
<tr id="row_0_8_2_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_8_2_0_" class="arrow" onclick="toggleFolder('0_8_2_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1reference_1_1host_1_1detail.html" target="_self">detail</a></td><td class="desc">Defines several helpers </td></tr>
<tr id="row_0_8_2_0_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc.html" target="_self">RandomGaussianFunc</a></td><td class="desc"></td></tr>
<tr id="row_0_8_2_0_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomGaussianFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomGaussianFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
<tr id="row_0_8_2_0_2_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc.html" target="_self">RandomUniformFunc</a></td><td class="desc"></td></tr>
<tr id="row_0_8_2_0_3_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1RandomUniformFunc_3_01complex_3_01Element_01_4_01_4.html" target="_self">RandomUniformFunc&lt; complex&lt; Element &gt; &gt;</a></td><td class="desc">Partial specialization for initializing a complex value </td></tr>
<tr id="row_0_8_2_0_4_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorContainsFunc.html" target="_self">TensorContainsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_5_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorCopyIf.html" target="_self">TensorCopyIf</a></td><td class="desc">Helper to conditionally copy between tensor views </td></tr>
<tr id="row_0_8_2_0_6_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorEqualsFunc.html" target="_self">TensorEqualsFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_7_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillDiagonalFunc.html" target="_self">TensorFillDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_8_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillFunc.html" target="_self">TensorFillFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_9_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillGaussianFunc.html" target="_self">TensorFillGaussianFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_2_0_10_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillLinearFunc.html" target="_self">TensorFillLinearFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_11_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFillRandomUniformFunc.html" target="_self">TensorFillRandomUniformFunc</a></td><td class="desc">Computes a random Gaussian distribution </td></tr>
<tr id="row_0_8_2_0_12_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper.html" target="_self">TensorForEachHelper</a></td><td class="desc">Helper to perform for-each operation </td></tr>
<tr id="row_0_8_2_0_13_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorForEachHelper_3_01Func_00_01Rank_00_010_01_4.html" target="_self">TensorForEachHelper&lt; Func, Rank, 0 &gt;</a></td><td class="desc">Helper to perform for-each operation </td></tr>
<tr id="row_0_8_2_0_14_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorFuncBinaryOp.html" target="_self">TensorFuncBinaryOp</a></td><td class="desc">Helper to apply a binary operator in place </td></tr>
<tr id="row_0_8_2_0_15_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TensorUpdateOffDiagonalFunc.html" target="_self">TensorUpdateOffDiagonalFunc</a></td><td class="desc">&lt; Layout function </td></tr>
<tr id="row_0_8_2_0_16_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1detail_1_1TrivialConvert.html" target="_self">TrivialConvert</a></td><td class="desc">Helper to convert between types </td></tr>
<tr id="row_0_8_2_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1BlockForEach.html" target="_self">BlockForEach</a></td><td class="desc"></td></tr>
<tr id="row_0_8_2_2_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm.html" target="_self">Gemm</a></td><td class="desc"></td></tr>
<tr id="row_0_8_2_3_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_193dd3a37f00deff1e5dcd7c310afb1f.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd &gt;</a></td><td class="desc">Partial specialization for multiply-add </td></tr>
<tr id="row_0_8_2_4_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_55729eac7dbd6bf311ea36f680e83e93.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate &gt;</a></td><td class="desc">Partial specialization for multiply-add-saturate </td></tr>
<tr id="row_0_8_2_5_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1reference_1_1host_1_1Gemm_3_01ElementA_00_01LayoutA_00_01ElementB_00_01LayoutB_4f3f32c4b336238abfd741e87bfced46.html" target="_self">Gemm&lt; ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc &gt;</a></td><td class="desc">Partial specialization for XOR-popc </td></tr>
<tr id="row_0_9_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_9_" class="arrow" onclick="toggleFolder('0_9_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_9_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1thread_1_1Matrix.html" target="_self">Matrix</a></td><td class="desc">Per-thread matrix object storing a packed matrix </td></tr>
<tr id="row_0_10_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_10_" class="arrow" onclick="toggleFolder('0_10_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform.html" target="_self">transform</a></td><td class="desc"></td></tr>
<tr id="row_0_10_0_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_0_" class="arrow" onclick="toggleFolder('0_10_0_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1thread.html" target="_self">thread</a></td><td class="desc"></td></tr>
<tr id="row_0_10_0_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1thread_1_1Transpose.html" target="_self">Transpose</a></td><td class="desc">Transforms a fragment by doing a transpose </td></tr>
<tr id="row_0_10_0_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1thread_1_1Transpose_3_01ElementCount___00_01layout_1_1PitchLinearS99f8e05faf0bb5ed48a0154afe740d81.html" target="_self">Transpose&lt; ElementCount_, layout::PitchLinearShape&lt; 4, 4 &gt;, int8_t &gt;</a></td><td class="desc">Specialization for int8_t 4x4 transpose </td></tr>
<tr id="row_0_10_1_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_1_" class="arrow" onclick="toggleFolder('0_10_1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><a class="el" href="namespacecutlass_1_1transform_1_1threadblock.html" target="_self">threadblock</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator.html" target="_self">PredicatedTileAccessIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_1_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile.html" target="_self">PredicatedTileAccessIterator2dThreadTile</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_2_" class="arrow" onclick="toggleFolder('0_10_1_2_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__da632779aba661c0f4cfaaa78126b771.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_2_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__18e9cf25bb3b8edfaad595241a6dc2d7.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_3_" class="arrow" onclick="toggleFolder('0_10_1_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__1790abaa54a01f277d75766d5882fec8.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_3_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__8ccc62d47a092afc8bee32ffe9d1e4ba.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_4_" class="arrow" onclick="toggleFolder('0_10_1_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__7327fa15996bcb8502cdfcc192350fe1.html" target="_self">PredicatedTileAccessIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_4_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator2dThreadTile_3_01Shape__a56cbccec33ee916292ad9d068474609.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_5_" class="arrow" onclick="toggleFolder('0_10_1_5_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen89c687c583745a73cb485041911a4c4e.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_5_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenc07b5ec72f83e782121ac629288d61fe.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_6_" class="arrow" onclick="toggleFolder('0_10_1_6_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemenab63a1e105bf37f6371516cb9e2c5a7a.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_6_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemena9b06926a275b569ee9f7f142604b997.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_7_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_7_" class="arrow" onclick="toggleFolder('0_10_1_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen784a0e9da3f55064c47e5613791f51f7.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_7_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen41e459f664d17473570cf22fb616845f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_8_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_8_" class="arrow" onclick="toggleFolder('0_10_1_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen9838736ad62fae54213fbaf722a989ab.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_8_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen44ce348364e78f5a56fa0c2cef6af930.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_9_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_9_" class="arrow" onclick="toggleFolder('0_10_1_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen809793e785fb4211888c6b4e5dcfcb39.html" target="_self">PredicatedTileAccessIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessType_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_9_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileAccessIterator_3_01Shape___00_01Elemen058417e2cdd86f3cd6ad5458581571c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_10_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator.html" target="_self">PredicatedTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_11_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile.html" target="_self">PredicatedTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_12_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_12_" class="arrow" onclick="toggleFolder('0_10_1_12_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0165b39a630d10785a3558406f9adb99b9.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_12_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_01e11ed7192af5d7ad1bce5641fa13112e.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_13_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_13_" class="arrow" onclick="toggleFolder('0_10_1_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_017a517f3c73efd795ab05059cc9b111e1.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_13_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0b878062cc0cd214bf7e17d74ff17e246.html" target="_self">AccessType</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_13_1_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0145ef045e8f7d57dc718098adcb00cf3d.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_14_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_14_" class="arrow" onclick="toggleFolder('0_10_1_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_013671177d6219bfeb0e1b4dc4c1b5bf11.html" target="_self">PredicatedTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Transpose_ &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_14_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator2dThreadTile_3_01Shape___00_0102e766863c6ac9ec2063a02c4803eecb.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_15_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_15_" class="arrow" onclick="toggleFolder('0_10_1_15_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0068b3e874b5d93d11f0fa902c7f1d11d9.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_15_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00a6b756b1bcfbb35fe4a3e68ff074e380.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_16_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_16_" class="arrow" onclick="toggleFolder('0_10_1_16_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00f6b3a9dfab5e7c72d5233f7e5e6e3b9b.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_16_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00ebd1a63351e1085d0b718582ec7b06c8.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_17_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_17_" class="arrow" onclick="toggleFolder('0_10_1_17_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00e7c2c404e7aedfe60ad56bb5571306a1.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_17_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___006a5f2f7a8271031e6cdc5daa5441f2af.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_18_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_18_" class="arrow" onclick="toggleFolder('0_10_1_18_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___0041ea81994f8af0d4d071fdb9e66b5ff0.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_18_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___004d0f9b5e19c29acc17bcdc360dafebbd.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_19_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_19_" class="arrow" onclick="toggleFolder('0_10_1_19_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___00d670f969180a8d182dffb356ebcc957e.html" target="_self">PredicatedTileIterator&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; InterleavedK &gt;, AdvanceRank, ThreadMap_, AccessSize &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_19_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1PredicatedTileIterator_3_01Shape___00_01Element___009fd89f6dad84238fd7d63df0a0c0364f.html" target="_self">Params</a></td><td class="desc">Parameters object is precomputed state and is host-constructible </td></tr>
<tr id="row_0_10_1_20_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator.html" target="_self">RegularTileAccessIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_21_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__eb7d20f8b9d69e0ae5e7ef51dc480867.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_22_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__2c1476eaf582bfe972793e17babfe985.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_23_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__a3c11cf1f00ef7a1efb8389ac6e4c6e0.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_24_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0855e9d9ab619202d2397180c1e4c4a5.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_25_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__f04332958a49a47d6fb2b25201764630.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_26_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__6baada077236f1a368c61c5e11b45b72.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_27_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__0184b7188941788a96624510a4b2f876.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_28_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_28_" class="arrow" onclick="toggleFolder('0_10_1_28_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__ebf4714349612673e8b6609b763eeb6f.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_28_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_0a9491607d11be8e1780e79ad711aa42.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_29_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_29_" class="arrow" onclick="toggleFolder('0_10_1_29_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element__e9a9e0f4286f652f55eb9b863b21effe.html" target="_self">RegularTileAccessIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_29_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileAccessIterator_3_01Shape___00_01Element_3be8b96d170d886f39b6b30acab65e7a.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_30_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator.html" target="_self">RegularTileIterator</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_31_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile.html" target="_self">RegularTileIterator2dThreadTile</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_32_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Eleb60d066756d1c18f05fceee6a27bdb8a.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::ColumnMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
<tr id="row_0_10_1_33_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele76ed82829532ae1c17f4c78158f036c7.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear + 2d thread-tiled threadmapping </td></tr>
<tr id="row_0_10_1_34_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator2dThreadTile_3_01Shape___00_01Ele654c8f6161ae5340f040397a4e2e045c.html" target="_self">RegularTileIterator2dThreadTile&lt; Shape_, Element_, layout::RowMajorInterleaved&lt; 4 &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping </td></tr>
<tr id="row_0_10_1_35_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_011d3637dbd8bc58bcb020b51bf57fbfc0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
<tr id="row_0_10_1_36_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_017982f81d4ef592e19c8427de2ea933a3.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_37_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_010889a732373c350de9b9a9f6c13cd761.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_38_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01187f8574e1fe9d7d5e8fbf09bd834bf0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_39_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01793f74bfd8f116a827948ab01a37349a.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_40_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01bd31b3810c1fedf2e7e5959ff92b5d3d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kRow &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_41_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0184a89653916f5d51ab59d1b386989a17.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
<tr id="row_0_10_1_42_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0149454d361ea5885cf5166a920b5145df.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc">Regular tile iterator specialized for pitch-linear </td></tr>
<tr id="row_0_10_1_43_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01c20d35180520077a5a09b1e33543c1a5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_44_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a31b454d9c930525c1e9ca406a514f40.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_45_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0104ad31bd559a88cc418ae1cab7492ed5.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_46_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f6f6511b5033cad31083644ac69c54d8.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_47_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01b3fa5720e807697de61b9f937b269cd0.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kColumn &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_48_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_48_" class="arrow" onclick="toggleFolder('0_10_1_48_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01efd5013a2503d6567e2bf6b40c97360c.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value, int(128/sizeof(Element_))&gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_48_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_052caec9d5bceeb59b9a13cb3338ce64d.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_49_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_49_" class="arrow" onclick="toggleFolder('0_10_1_49_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0197fef2242a3454a7d1cebe61aee28b43.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::TensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Crosswise &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_49_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_039093927f4b1ee61538c569bf1ae4efd.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_50_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_50_" class="arrow" onclick="toggleFolder('0_10_1_50_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01a75d2cd74e722d6ad6a3b41aabfd432d.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandBCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_50_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_02d305cfb0b55c6fb236a52cf2240651e.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_51_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_51_" class="arrow" onclick="toggleFolder('0_10_1_51_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01f96bbeb63e6d4ce4a2551279de3a9f0e.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCongruous&lt; sizeof_bits&lt; Element_ &gt;::value &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_51_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_032f88d1be8b209e44a4815c707ba35bb.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_1_52_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span id="arr_0_10_1_52_" class="arrow" onclick="toggleFolder('0_10_1_52_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_01dbd6b8468d5bd787308d2f615a24d123.html" target="_self">RegularTileIterator&lt; Shape_, Element_, layout::VoltaTensorOpMultiplicandCrosswise&lt; sizeof_bits&lt; Element_ &gt;::value, Shape_::kContiguous &gt;, AdvanceRank, ThreadMap_, Alignment &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_1_52_0_" style="display:none;"><td class="entry"><span style="width:80px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1threadblock_1_1RegularTileIterator_3_01Shape___00_01Element___00_0390833403016f5d817416e20828845df.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection </td></tr>
<tr id="row_0_10_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_10_3_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_3_" class="arrow" onclick="toggleFolder('0_10_3_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread0082c3467229b12cc9dd996283ee7160.html" target="_self">PitchLinear2DThreadTileStripminedThreadMap&lt; Shape_, Threads, cutlass::layout::PitchLinearShape&lt; 4, 4 &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_10_3_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinear2DThreadTileStripminedThreadMap_3_01Shape___00_01Thread896c01a3c466da1bf392e0cdfced4d53.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
<tr id="row_0_10_4_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_4_" class="arrow" onclick="toggleFolder('0_10_4_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap.html" target="_self">PitchLinearStripminedThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_10_4_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearStripminedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal implementation details </td></tr>
<tr id="row_0_10_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadContiguous.html" target="_self">PitchLinearTilePolicyStripminedThreadContiguous</a></td><td class="desc"></td></tr>
<tr id="row_0_10_6_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearTilePolicyStripminedThreadStrided.html" target="_self">PitchLinearTilePolicyStripminedThreadStrided</a></td><td class="desc"></td></tr>
<tr id="row_0_10_7_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_7_" class="arrow" onclick="toggleFolder('0_10_7_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap.html" target="_self">PitchLinearWarpRakedThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_10_7_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpRakedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
<tr id="row_0_10_8_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_8_" class="arrow" onclick="toggleFolder('0_10_8_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap.html" target="_self">PitchLinearWarpStripedThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_10_8_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1PitchLinearWarpStripedThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
<tr id="row_0_10_9_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span id="arr_0_10_9_" class="arrow" onclick="toggleFolder('0_10_9_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap.html" target="_self">TransposePitchLinearThreadMap</a></td><td class="desc"></td></tr>
<tr id="row_0_10_9_0_" style="display:none;"><td class="entry"><span style="width:64px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap_1_1Detail.html" target="_self">Detail</a></td><td class="desc">Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) </td></tr>
<tr id="row_0_10_10_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMap2DThreadTile.html" target="_self">TransposePitchLinearThreadMap2DThreadTile</a></td><td class="desc">Thread Mapping a 2D threadtiled mapping as a transposed Pitchlinear2DThreadTile mapping </td></tr>
<tr id="row_0_10_11_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1transform_1_1TransposePitchLinearThreadMapSimt.html" target="_self">TransposePitchLinearThreadMapSimt</a></td><td class="desc"></td></tr>
<tr id="row_0_11_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1AlignedArray.html" target="_self">AlignedArray</a></td><td class="desc">Aligned array type </td></tr>
<tr id="row_0_12_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1AlignedBuffer.html" target="_self">AlignedBuffer</a></td><td class="desc">Modifies semantics of cutlass::Array&lt;&gt; to provide guaranteed alignment </td></tr>
<tr id="row_0_13_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_13_" class="arrow" onclick="toggleFolder('0_13_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4.html" target="_self">Array&lt; T, N, false &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
<tr id="row_0_13_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
<tr id="row_0_13_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reference.html" target="_self">const_reference</a></td><td class="desc">Reference object extracts sub-byte items </td></tr>
<tr id="row_0_13_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
<tr id="row_0_13_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
<tr id="row_0_13_4_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reference.html" target="_self">reference</a></td><td class="desc">Reference object inserts or extracts sub-byte items </td></tr>
<tr id="row_0_13_5_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01false_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
<tr id="row_0_14_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_14_" class="arrow" onclick="toggleFolder('0_14_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4.html" target="_self">Array&lt; T, N, true &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
<tr id="row_0_14_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__iterator.html" target="_self">const_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
<tr id="row_0_14_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1const__reverse__iterator.html" target="_self">const_reverse_iterator</a></td><td class="desc">Bidirectional constant iterator over elements </td></tr>
<tr id="row_0_14_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1iterator.html" target="_self">iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
<tr id="row_0_14_3_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Array_3_01T_00_01N_00_01true_01_4_1_1reverse__iterator.html" target="_self">reverse_iterator</a></td><td class="desc">Bidirectional iterator over elements </td></tr>
<tr id="row_0_15_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1CommandLine.html" target="_self">CommandLine</a></td><td class="desc"></td></tr>
<tr id="row_0_16_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1complex.html" target="_self">complex</a></td><td class="desc"></td></tr>
<tr id="row_0_17_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1ConstSubbyteReference.html" target="_self">ConstSubbyteReference</a></td><td class="desc"></td></tr>
<tr id="row_0_18_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Coord.html" target="_self">Coord</a></td><td class="desc">Statically-sized array specifying Coords within a tensor </td></tr>
<tr id="row_0_19_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1cuda__exception.html" target="_self">cuda_exception</a></td><td class="desc">C++ exception wrapper for CUDA <code>cudaError_t</code> </td></tr>
<tr id="row_0_20_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Distribution.html" target="_self">Distribution</a></td><td class="desc"><a class="el" href="structcutlass_1_1Distribution.html" title="Distribution type. ">Distribution</a> type </td></tr>
<tr id="row_0_21_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divide__assert.html" target="_self">divide_assert</a></td><td class="desc"></td></tr>
<tr id="row_0_22_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides.html" target="_self">divides</a></td><td class="desc"></td></tr>
<tr id="row_0_23_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_24_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1divides_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">divides&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_25_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType.html" target="_self">FloatType</a></td><td class="desc">Defines a floating-point type based on the number of exponent and mantissa bits </td></tr>
<tr id="row_0_26_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_0111_00_0152_01_4.html" target="_self">FloatType&lt; 11, 52 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_27_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_015_00_0110_01_4.html" target="_self">FloatType&lt; 5, 10 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_28_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1FloatType_3_018_00_0123_01_4.html" target="_self">FloatType&lt; 8, 23 &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_29_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1half__t.html" target="_self">half_t</a></td><td class="desc">IEEE half-precision floating-point type </td></tr>
<tr id="row_0_30_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1HostTensor.html" target="_self">HostTensor</a></td><td class="desc">Host tensor </td></tr>
<tr id="row_0_31_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1IdentityTensorLayout.html" target="_self">IdentityTensorLayout</a></td><td class="desc"></td></tr>
<tr id="row_0_32_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1integer__subbyte.html" target="_self">integer_subbyte</a></td><td class="desc">4-bit signed integer type </td></tr>
<tr id="row_0_33_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType.html" target="_self">IntegerType</a></td><td class="desc">Defines integers based on size and whether they are signed </td></tr>
<tr id="row_0_34_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01false_01_4.html" target="_self">IntegerType&lt; 1, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_35_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_011_00_01true_01_4.html" target="_self">IntegerType&lt; 1, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_36_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01false_01_4.html" target="_self">IntegerType&lt; 16, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_37_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0116_00_01true_01_4.html" target="_self">IntegerType&lt; 16, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_38_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01false_01_4.html" target="_self">IntegerType&lt; 32, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_39_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0132_00_01true_01_4.html" target="_self">IntegerType&lt; 32, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_40_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01false_01_4.html" target="_self">IntegerType&lt; 4, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_41_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_014_00_01true_01_4.html" target="_self">IntegerType&lt; 4, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_42_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01false_01_4.html" target="_self">IntegerType&lt; 64, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_43_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_0164_00_01true_01_4.html" target="_self">IntegerType&lt; 64, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_44_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01false_01_4.html" target="_self">IntegerType&lt; 8, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_45_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1IntegerType_3_018_00_01true_01_4.html" target="_self">IntegerType&lt; 8, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_46_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1is__pow2.html" target="_self">is_pow2</a></td><td class="desc"></td></tr>
<tr id="row_0_47_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1KernelLaunchConfiguration.html" target="_self">KernelLaunchConfiguration</a></td><td class="desc">Structure containing the basic launch configuration of a CUDA kernel </td></tr>
<tr id="row_0_48_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down.html" target="_self">log2_down</a></td><td class="desc"></td></tr>
<tr id="row_0_49_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__down_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_down&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_50_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up.html" target="_self">log2_up</a></td><td class="desc"></td></tr>
<tr id="row_0_51_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1log2__up_3_01N_00_011_00_01Count_01_4.html" target="_self">log2_up&lt; N, 1, Count &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_52_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixCoord.html" target="_self">MatrixCoord</a></td><td class="desc"></td></tr>
<tr id="row_0_53_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1MatrixShape.html" target="_self">MatrixShape</a></td><td class="desc">Describes the size of a matrix tile </td></tr>
<tr id="row_0_54_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Max.html" target="_self">Max</a></td><td class="desc"></td></tr>
<tr id="row_0_55_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum.html" target="_self">maximum</a></td><td class="desc"></td></tr>
<tr id="row_0_56_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">maximum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_57_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1maximum_3_01float_01_4.html" target="_self">maximum&lt; float &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_58_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Min.html" target="_self">Min</a></td><td class="desc"></td></tr>
<tr id="row_0_59_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum.html" target="_self">minimum</a></td><td class="desc"></td></tr>
<tr id="row_0_60_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minimum&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_61_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minimum_3_01float_01_4.html" target="_self">minimum&lt; float &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_62_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus.html" target="_self">minus</a></td><td class="desc"></td></tr>
<tr id="row_0_63_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_64_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1minus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">minus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_65_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies.html" target="_self">multiplies</a></td><td class="desc"></td></tr>
<tr id="row_0_66_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_67_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiplies_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiplies&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_68_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add.html" target="_self">multiply_add</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_69_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01half__t_00_01N_01_4_00_01Array_3_01half__t_00_01N_01adaeadb27c0e4439444709c0eb30963.html" target="_self">multiply_add&lt; Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt;, Array&lt; half_t, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_70_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_00_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">multiply_add&lt; Array&lt; T, N &gt;, Array&lt; T, N &gt;, Array&lt; T, N &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_71_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_72_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01complex_3_01T_01_4_00_01T_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; complex&lt; T &gt;, T, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_73_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1multiply__add_3_01T_00_01complex_3_01T_01_4_00_01complex_3_01T_01_4_01_4.html" target="_self">multiply_add&lt; T, complex&lt; T &gt;, complex&lt; T &gt; &gt;</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_0_74_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate.html" target="_self">negate</a></td><td class="desc"></td></tr>
<tr id="row_0_75_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_76_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1negate_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">negate&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_77_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter.html" target="_self">NumericArrayConverter</a></td><td class="desc">Conversion operator for Array </td></tr>
<tr id="row_0_78_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_012_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, 2, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;float, 2&gt; &lt;= Array&lt;half_t, 2&gt;, round to nearest </td></tr>
<tr id="row_0_79_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01float_00_01half__t_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; float, half_t, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
<tr id="row_0_80_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_012_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, 2, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Partial specialization for Array&lt;half, 2&gt; &lt;= Array&lt;float, 2&gt;, round to nearest </td></tr>
<tr id="row_0_81_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericArrayConverter_3_01half__t_00_01float_00_01N_00_01Round_01_4.html" target="_self">NumericArrayConverter&lt; half_t, float, N, Round &gt;</a></td><td class="desc">Partial specialization for Array&lt;half&gt; &lt;= Array&lt;float&gt; </td></tr>
<tr id="row_0_82_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter.html" target="_self">NumericConverter</a></td><td class="desc"></td></tr>
<tr id="row_0_83_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01float_00_01half__t_00_01Round_01_4.html" target="_self">NumericConverter&lt; float, half_t, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
<tr id="row_0_84_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__to__nearest_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_to_nearest &gt;</a></td><td class="desc">Specialization for round-to-nearest </td></tr>
<tr id="row_0_85_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01half__t_00_01float_00_01FloatRoundStyle_1_1round__toward__zero_01_4.html" target="_self">NumericConverter&lt; half_t, float, FloatRoundStyle::round_toward_zero &gt;</a></td><td class="desc">Specialization for round-toward-zero </td></tr>
<tr id="row_0_86_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01int8__t_00_01float_00_01Round_01_4.html" target="_self">NumericConverter&lt; int8_t, float, Round &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_87_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverter_3_01T_00_01T_00_01Round_01_4.html" target="_self">NumericConverter&lt; T, T, Round &gt;</a></td><td class="desc">Partial specialization for float &lt;= <a class="el" href="structcutlass_1_1half__t.html" title="IEEE half-precision floating-point type. ">half_t</a> </td></tr>
<tr id="row_0_88_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1NumericConverterClamp.html" target="_self">NumericConverterClamp</a></td><td class="desc"></td></tr>
<tr id="row_0_89_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus.html" target="_self">plus</a></td><td class="desc"></td></tr>
<tr id="row_0_90_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01half__t_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; half_t, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_91_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1plus_3_01Array_3_01T_00_01N_01_4_01_4.html" target="_self">plus&lt; Array&lt; T, N &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_92_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_92_" class="arrow" onclick="toggleFolder('0_92_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector.html" target="_self">PredicateVector</a></td><td class="desc">Statically sized array of bits implementing </td></tr>
<tr id="row_0_92_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1ConstIterator.html" target="_self">ConstIterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
<tr id="row_0_92_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" target="_self">Iterator</a></td><td class="desc">An iterator implementing <a class="el" href="group__predicate__iterator__concept.html">Predicate Iterator Concept</a> enabling sequential read and write access to predicates </td></tr>
<tr id="row_0_92_2_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1PredicateVector_1_1TrivialIterator.html" target="_self">TrivialIterator</a></td><td class="desc"><a class="el" href="classcutlass_1_1PredicateVector_1_1Iterator.html" title="An iterator implementing Predicate Iterator Concept enabling sequential read and write access to pred...">Iterator</a> that always returns true </td></tr>
<tr id="row_0_93_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType.html" target="_self">RealType</a></td><td class="desc">Used to determine the real-valued underlying type of a numeric type T </td></tr>
<tr id="row_0_94_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1RealType_3_01complex_3_01T_01_4_01_4.html" target="_self">RealType&lt; complex&lt; T &gt; &gt;</a></td><td class="desc">Partial specialization for complex-valued type </td></tr>
<tr id="row_0_95_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory.html" target="_self">ReferenceFactory</a></td><td class="desc"></td></tr>
<tr id="row_0_96_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01false_01_4.html" target="_self">ReferenceFactory&lt; Element, false &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_97_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ReferenceFactory_3_01Element_00_01true_01_4.html" target="_self">ReferenceFactory&lt; Element, true &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_98_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1ScalarIO.html" target="_self">ScalarIO</a></td><td class="desc">Helper to enable formatted printing of CUTLASS scalar types to an ostream </td></tr>
<tr id="row_0_99_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1Semaphore.html" target="_self">Semaphore</a></td><td class="desc">CTA-wide semaphore for inter-CTA synchronization </td></tr>
<tr id="row_0_100_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits.html" target="_self">sizeof_bits</a></td><td class="desc">Defines the size of an element in bits </td></tr>
<tr id="row_0_101_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01Array_3_01T_00_01N_00_01RegisterSized_01_4_01_4.html" target="_self">sizeof_bits&lt; Array&lt; T, N, RegisterSized &gt; &gt;</a></td><td class="desc">Statically sized array for any data type </td></tr>
<tr id="row_0_102_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01bin1__t_01_4.html" target="_self">sizeof_bits&lt; bin1_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for bin1_t </td></tr>
<tr id="row_0_103_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01int4b__t_01_4.html" target="_self">sizeof_bits&lt; int4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for int4b_t </td></tr>
<tr id="row_0_104_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint1b__t_01_4.html" target="_self">sizeof_bits&lt; uint1b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint1b_t </td></tr>
<tr id="row_0_105_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sizeof__bits_3_01uint4b__t_01_4.html" target="_self">sizeof_bits&lt; uint4b_t &gt;</a></td><td class="desc">Defines the size of an element in bits - specialized for uint4b_t </td></tr>
<tr id="row_0_106_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1sqrt__est.html" target="_self">sqrt_est</a></td><td class="desc"></td></tr>
<tr id="row_0_107_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1SubbyteReference.html" target="_self">SubbyteReference</a></td><td class="desc"></td></tr>
<tr id="row_0_108_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1Tensor4DCoord.html" target="_self">Tensor4DCoord</a></td><td class="desc">Defines a canonical 4D coordinate used by tensor operations </td></tr>
<tr id="row_0_109_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorRef.html" target="_self">TensorRef</a></td><td class="desc"></td></tr>
<tr id="row_0_110_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="classcutlass_1_1TensorView.html" target="_self">TensorView</a></td><td class="desc"></td></tr>
<tr id="row_0_111_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits.html" target="_self">TypeTraits</a></td><td class="desc"></td></tr>
<tr id="row_0_112_" style="display:none;"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span id="arr_0_112_" class="arrow" onclick="toggleFolder('0_112_')">&#9658;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; double &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_112_0_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1integer__type.html" target="_self">integer_type</a></td><td class="desc"></td></tr>
<tr id="row_0_112_1_" style="display:none;"><td class="entry"><span style="width:48px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01double_01_4_01_4_1_1unsigned__type.html" target="_self">unsigned_type</a></td><td class="desc"></td></tr>
<tr id="row_0_113_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01float_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; float &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_114_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_115_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01complex_3_01half__t_01_4_01_4.html" target="_self">TypeTraits&lt; complex&lt; half_t &gt; &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_116_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01double_01_4.html" target="_self">TypeTraits&lt; double &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_117_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01float_01_4.html" target="_self">TypeTraits&lt; float &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_118_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01half__t_01_4.html" target="_self">TypeTraits&lt; half_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_119_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int_01_4.html" target="_self">TypeTraits&lt; int &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_120_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int64__t_01_4.html" target="_self">TypeTraits&lt; int64_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_121_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01int8__t_01_4.html" target="_self">TypeTraits&lt; int8_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_122_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint64__t_01_4.html" target="_self">TypeTraits&lt; uint64_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_123_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01uint8__t_01_4.html" target="_self">TypeTraits&lt; uint8_t &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_124_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1TypeTraits_3_01unsigned_01_4.html" target="_self">TypeTraits&lt; unsigned &gt;</a></td><td class="desc"></td></tr>
<tr id="row_0_125_" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structcutlass_1_1xor__add.html" target="_self">xor_add</a></td><td class="desc">Fused multiply-add </td></tr>
<tr id="row_1_"><td class="entry"><span style="width:0px;display:inline-block;">&#160;</span><span id="arr_1_" class="arrow" onclick="toggleFolder('1_')">&#9658;</span><span class="icona"><span class="icon">N</span></span><b>std</b></td><td class="desc">STL namespace </td></tr>
<tr id="row_1_0_" class="even" style="display:none;"><td class="entry"><span style="width:32px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structstd_1_1numeric__limits_3_01cutlass_1_1half__t_01_4.html" target="_self">numeric_limits&lt; cutlass::half_t &gt;</a></td><td class="desc">Numeric limits </td></tr>
<tr id="row_2_" class="even"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugType.html" target="_self">DebugType</a></td><td class="desc"></td></tr>
<tr id="row_3_"><td class="entry"><span style="width:16px;display:inline-block;">&#160;</span><span class="icona"><span class="icon">C</span></span><a class="el" href="structDebugValue.html" target="_self">DebugValue</a></td><td class="desc"></td></tr>
</table>
</div><!-- directory -->
</div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
Generated by &#160;<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/>
</a> 1.8.11
</small></address>
</body>
</html>

156
docs/arch_2mma_8h.html Normal file
View File

@ -0,0 +1,156 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.11"/>
<title>CUTLASS: mma.h File Reference</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
<td id="projectalign" style="padding-left: 0.5em;">
<div id="projectname">CUTLASS
</div>
<div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.11 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="namespaces.html"><span>Namespaces</span></a></li>
<li><a href="annotated.html"><span>Classes</span></a></li>
<li class="current"><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
<div id="navrow2" class="tabs2">
<ul class="tablist">
<li><a href="files.html"><span>File&#160;List</span></a></li>
<li><a href="globals.html"><span>File&#160;Members</span></a></li>
</ul>
</div>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div id="nav-path" class="navpath">
<ul>
<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li> </ul>
</div>
</div><!-- top -->
<div class="header">
<div class="summary">
<a href="#nested-classes">Classes</a> &#124;
<a href="#namespaces">Namespaces</a> </div>
<div class="headertitle">
<div class="title">arch/mma.h File Reference</div> </div>
</div><!--header-->
<div class="contents">
<p>Templates exposing architecture support for multiply-add operations.
<a href="#details">More...</a></p>
<div class="textblock"><code>#include &quot;<a class="el" href="array_8h_source.html">cutlass/array.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="numeric__types_8h_source.html">cutlass/numeric_types.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="arch_2mma__sm50_8h_source.html">cutlass/arch/mma_sm50.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="arch_2mma__sm60_8h_source.html">cutlass/arch/mma_sm60.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="arch_2mma__sm61_8h_source.html">cutlass/arch/mma_sm61.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="mma__sm70_8h_source.html">cutlass/arch/mma_sm70.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="mma__sm75_8h_source.html">cutlass/arch/mma_sm75.h</a>&quot;</code><br />
</div><div class="textblock"><div class="dynheader">
Include dependency graph for arch/mma.h:</div>
<div class="dyncontent">
<div class="center"><img src="arch_2mma_8h__incl.png" border="0" usemap="#mma_8h" alt=""/></div>
<map name="mma_8h" id="mma_8h">
</map>
</div>
</div><div class="textblock"><div class="dynheader">
This graph shows which files directly or indirectly include this file:</div>
<div class="dyncontent">
<div class="center"><img src="arch_2mma_8h__dep__incl.png" border="0" usemap="#mma_8hdep" alt=""/></div>
<map name="mma_8hdep" id="mma_8hdep">
</map>
</div>
</div>
<p><a href="arch_2mma_8h_source.html">Go to the source code of this file.</a></p>
<table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
Classes</h2></td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma.html">cutlass::arch::Mma&lt; Shape_, kThreads_, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01ElementAb6e65b2cf5ede7f41cb070a767158dee.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table><table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
Namespaces</h2></td></tr>
<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table>
</div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
Generated by &#160;<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/>
</a> 1.8.11
</small></address>
</body>
</html>

View File

@ -0,0 +1 @@
7d16b59e6ba0442b8a275a213d5da3a6

View File

@ -0,0 +1 @@
d1fff3f9d55a262110aa6a456caa91e0

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,176 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=9"/>
<meta name="generator" content="Doxygen 1.8.11"/>
<title>CUTLASS: mma_sm50.h File Reference</title>
<link href="tabs.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript" src="dynsections.js"></script>
<link href="search/search.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript" src="search/searchdata.js"></script>
<script type="text/javascript" src="search/search.js"></script>
<script type="text/javascript">
$(document).ready(function() { init_search(); });
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
});
</script><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js"></script>
<link href="doxygen.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
<div id="titlearea">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr style="height: 56px;">
<td id="projectlogo"><img alt="Logo" src="cutlass-logo-small.png"/></td>
<td id="projectalign" style="padding-left: 0.5em;">
<div id="projectname">CUTLASS
</div>
<div id="projectbrief">CUDA Templates for Linear Algebra Subroutines and Solvers</div>
</td>
</tr>
</tbody>
</table>
</div>
<!-- end header part -->
<!-- Generated by Doxygen 1.8.11 -->
<script type="text/javascript">
var searchBox = new SearchBox("searchBox", "search",false,'Search');
</script>
<div id="navrow1" class="tabs">
<ul class="tablist">
<li><a href="index.html"><span>Main&#160;Page</span></a></li>
<li><a href="modules.html"><span>Modules</span></a></li>
<li><a href="namespaces.html"><span>Namespaces</span></a></li>
<li><a href="annotated.html"><span>Classes</span></a></li>
<li class="current"><a href="files.html"><span>Files</span></a></li>
<li>
<div id="MSearchBox" class="MSearchBoxInactive">
<span class="left">
<img id="MSearchSelect" src="search/mag_sel.png"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
alt=""/>
<input type="text" id="MSearchField" value="Search" accesskey="S"
onfocus="searchBox.OnSearchFieldFocus(true)"
onblur="searchBox.OnSearchFieldFocus(false)"
onkeyup="searchBox.OnSearchFieldChange(event)"/>
</span><span class="right">
<a id="MSearchClose" href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" border="0" src="search/close.png" alt=""/></a>
</span>
</div>
</li>
</ul>
</div>
<div id="navrow2" class="tabs2">
<ul class="tablist">
<li><a href="files.html"><span>File&#160;List</span></a></li>
<li><a href="globals.html"><span>File&#160;Members</span></a></li>
</ul>
</div>
<!-- window showing the filter options -->
<div id="MSearchSelectWindow"
onmouseover="return searchBox.OnSearchSelectShow()"
onmouseout="return searchBox.OnSearchSelectHide()"
onkeydown="return searchBox.OnSearchSelectKey(event)">
</div>
<!-- iframe showing the search results (closed by default) -->
<div id="MSearchResultsWindow">
<iframe src="javascript:void(0)" frameborder="0"
name="MSearchResults" id="MSearchResults">
</iframe>
</div>
<div id="nav-path" class="navpath">
<ul>
<li class="navelem"><a class="el" href="dir_d44c64559bbebec7f509842c48db8b23.html">include</a></li><li class="navelem"><a class="el" href="dir_6baf2bb612a2f0daa69af3101ede80a1.html">cutlass</a></li><li class="navelem"><a class="el" href="dir_048c1df36ab9c2efbb0733edba6291c9.html">arch</a></li> </ul>
</div>
</div><!-- top -->
<div class="header">
<div class="summary">
<a href="#nested-classes">Classes</a> &#124;
<a href="#namespaces">Namespaces</a> </div>
<div class="headertitle">
<div class="title">arch/mma_sm50.h File Reference</div> </div>
</div><!--header-->
<div class="contents">
<p>Matrix multiply.
<a href="#details">More...</a></p>
<div class="textblock"><code>#include &quot;<a class="el" href="arch_2mma_8h_source.html">cutlass/arch/mma.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="complex_8h_source.html">cutlass/complex.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="layout_2matrix_8h_source.html">cutlass/layout/matrix.h</a>&quot;</code><br />
<code>#include &quot;<a class="el" href="include_2cutlass_2gemm_2gemm_8h_source.html">cutlass/gemm/gemm.h</a>&quot;</code><br />
</div><div class="textblock"><div class="dynheader">
Include dependency graph for arch/mma_sm50.h:</div>
<div class="dyncontent">
<div class="center"><img src="arch_2mma__sm50_8h__incl.png" border="0" usemap="#mma__sm50_8h" alt=""/></div>
<map name="mma__sm50_8h" id="mma__sm50_8h">
</map>
</div>
</div><div class="textblock"><div class="dynheader">
This graph shows which files directly or indirectly include this file:</div>
<div class="dyncontent">
<div class="center"><img src="arch_2mma__sm50_8h__dep__incl.png" border="0" usemap="#mma__sm50_8hdep" alt=""/></div>
<map name="mma__sm50_8hdep" id="mma__sm50_8hdep">
</map>
</div>
</div>
<p><a href="arch_2mma__sm50_8h_source.html">Go to the source code of this file.</a></p>
<table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="nested-classes"></a>
Classes</h2></td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_004bb3fd76ca2af7b3210676fa9644d95b.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_0aa57e6a2e6b5da37d10688bf99419a23.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01int_00_00b2dff9ce8caad9aff5bc6a355539161.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_76f9d24016e1b4167b16f4d7628c9546.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; float &gt;, LayoutA, float, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_f1c9d2ee842455cd0c5b71d56108d468.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, float, LayoutA, complex&lt; float &gt;, LayoutB, complex&lt; float &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01float_00e3e12e263df6506b8cf06c3f4d478b8e.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_30fa42e1ad201df010637cd22fc070a1.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, complex&lt; double &gt;, LayoutA, double, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01complex_48b3a43bc03fff93a111ac01abe7e40d.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, double, LayoutA, complex&lt; double &gt;, LayoutB, complex&lt; double &gt;, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01double_070b94670e040ed5855e5b42d5ca8a443.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:"><td class="memItemLeft" align="right" valign="top">struct &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html">cutlass::arch::Mma&lt; gemm::GemmShape&lt; 1, 1, 1 &gt;, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd &gt;</a></td></tr>
<tr class="memdesc:"><td class="mdescLeft">&#160;</td><td class="mdescRight">Matrix multiply-add operation. <a href="structcutlass_1_1arch_1_1Mma_3_01gemm_1_1GemmShape_3_011_00_011_00_011_01_4_00_011_00_01half__t_4f30ee91f7bb3844ff7579c68d078818.html#details">More...</a><br /></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table><table class="memberdecls">
<tr class="heading"><td colspan="2"><h2 class="groupheader"><a name="namespaces"></a>
Namespaces</h2></td></tr>
<tr class="memitem:namespacecutlass"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass.html">cutlass</a></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
<tr class="memitem:namespacecutlass_1_1arch"><td class="memItemLeft" align="right" valign="top"> &#160;</td><td class="memItemRight" valign="bottom"><a class="el" href="namespacecutlass_1_1arch.html">cutlass::arch</a></td></tr>
<tr class="separator:"><td class="memSeparator" colspan="2">&#160;</td></tr>
</table>
</div><!-- contents -->
<!-- start footer part -->
<hr class="footer"/><address class="footer"><small>
Generated by &#160;<a href="http://www.doxygen.org/index.html">
<img class="footer" src="doxygen.png" alt="doxygen"/>
</a> 1.8.11
</small></address>
</body>
</html>

View File

@ -0,0 +1 @@
988e6466c703c4e63c9a889b8c3c54b5

View File

@ -0,0 +1 @@
03f1613fdffbd6e7575de0d2967d08bf

Some files were not shown because too many files have changed in this diff Show More