|
|
fa7e254a7f
|
[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
|
2025-09-30 17:14:41 +08:00 |
|
|
|
3468f17ebe
|
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names (#25489)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-09-25 17:37:50 +00:00 |
|
|
|
05c19485a5
|
[Kernel] Support DCP for Triton backend (#25132)
Signed-off-by: Wei Wei <wwei6@meta.com>
|
2025-09-24 18:09:34 -07:00 |
|
|
|
969b4da3a6
|
[V0 Deprecation] Remove placeholder attn (#25510)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-09-23 22:12:14 +00:00 |
|
|
|
100b630a60
|
[V1][Kernel] Add triton implementation for reshape_and_cache_flash (#24503)
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-09-23 12:52:40 -04:00 |
|
|
|
b6a136b58c
|
[CI/Build] Fix disabled v1 attention backend selection test (#25471)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-09-23 13:05:46 +00:00 |
|
|
|
bc6e542d9f
|
Remove V0 attention backends (#25351)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-21 16:03:28 -07:00 |
|
|
|
52c2a8d4ad
|
[V0 Deprecation] Remove LLMEngine (#25033)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-09-20 17:56:30 -07:00 |
|
|
|
a2a5f79e09
|
Optimize triton unified attention performance for sliding window attention (#24390)
Signed-off-by: zixi-qi <qizixi@meta.com>
|
2025-09-19 13:07:26 -06:00 |
|
|
|
01a583fea4
|
[Kernel] Decouple Tile Size from Block Size in Triton Unified Attention Kernel (#21197)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2025-09-18 14:27:01 +00:00 |
|
|
|
087c6ffc92
|
[CI Bugfix] Fix failing test_invalid_env (#25078)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-09-17 08:28:58 -07:00 |
|
|
|
759ef49b15
|
Remove V0 Encoder-Decoder Support (#24907)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
|
2025-09-15 21:17:14 -07:00 |
|
|
|
5fe643fc26
|
Add FLASHINFER_MLA to backend selector test (#24753)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-09-12 22:30:07 +00:00 |
|
|
|
72fc8aa412
|
[Multi Modal] Add FA3 in VIT (#24347)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
|
2025-09-12 21:27:24 +08:00 |
|
|
|
dcb28a332b
|
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration (#21078)
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-09-10 15:31:10 -07:00 |
|
|
|
6cbd41909e
|
Feature/vit attention unification# 23880 (#23978)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-09-10 06:10:14 -07:00 |
|
|
|
bba1042c6f
|
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel (#23647)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
|
2025-09-08 20:53:07 -07:00 |
|
|
|
402759d472
|
[Attention] FlashAttn MLA (#14258)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-09-04 02:47:59 -07:00 |
|
|
|
186aced5ff
|
[Kernel] cuda kernels for upcoming decode context parallel feature (#23791)
Co-authored-by: hongchao <hongchao@msh.team>
|
2025-08-28 15:29:11 +08:00 |
|
|
|
24d0c9e6ed
|
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel (#22703)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-08-22 22:09:05 +00:00 |
|
|
|
19fe1a0510
|
[Kernel] Add FP8 support with FlashMLA backend (#22668)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-08-22 02:26:32 +00:00 |
|
|
|
1d353b6352
|
[Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-08-21 16:02:11 -04:00 |
|
|
|
03752dba8f
|
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-08-19 08:22:15 -04:00 |
|
|
|
14006840ea
|
[V0 Deprecation] Remove V0 FlashInfer attention backend (#22776)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 19:54:16 -07:00 |
|
|
|
1723ef1aae
|
minor: zero workspace buffer init for flashinfer trtllm-gen attn (#22603)
|
2025-08-15 21:38:10 +00:00 |
|
|
|
0fe85087a9
|
[CI Perf] Prune tests in tests/kernels/attention/ (#22936)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-14 21:34:53 -06:00 |
|
|
|
53c730286c
|
[Misc] parametrize 'dtype' in test_flash_mla (#22641)
Signed-off-by: RUTHLESS-BOT <wujiafeng@cmbchina.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-12 16:31:48 -04:00 |
|
|
|
83156c7b89
|
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
|
2025-08-05 02:45:34 -07:00 |
|
|
|
e79a12fc3a
|
[UX] Fail if an invalid attention backend is specified (#22217)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-08-04 23:54:52 -07:00 |
|
|
|
58b11b24a6
|
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
|
2025-07-29 10:34:00 -04:00 |
|
|
|
b3caeb82e7
|
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
|
2025-07-25 06:50:21 -07:00 |
|
|
|
6dda13c86b
|
[Misc] Add sliding window to flashinfer test (#21282)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-21 08:37:49 -07:00 |
|
|
|
752c6ade2e
|
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-19 13:53:17 -07:00 |
|
|
|
7bd4c37ae7
|
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 09:23:23 +00:00 |
|
|
|
9fb52e523a
|
[V1] Support any head size for FlexAttention backend (#20467)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-06 09:54:36 -07:00 |
|
|
|
e202dd2736
|
[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
|
2025-07-06 08:48:13 -07:00 |
|
|
|
c5ebe040ac
|
test_attention compat with coming xformers change (#20487)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-07-05 19:37:59 -07:00 |
|
|
|
32c9be2200
|
[v1] Re-add fp32 support to v1 engine through FlexAttention (#19754)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-07-05 09:41:10 +00:00 |
|
|
|
96453cfa83
|
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine (#19067)
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
|
2025-07-01 16:12:19 +08:00 |
|
|
|
879f69bed3
|
[Refactor] Remove duplicate ceil_div (#20023)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-06-25 05:19:09 +00:00 |
|
|
|
0b73736a0d
|
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check (#19339)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
|
2025-06-15 13:43:48 +08:00 |
|
|
|
2f1c19b245
|
[CI] change spell checker from codespell to typos (#18711)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-06-11 19:57:10 -07:00 |
|
|
|
7c644ab6d5
|
Fix Typo in Documentation and Function Name (#19442)
|
2025-06-10 22:44:11 -07:00 |
|
|
|
5f1ac1e1d1
|
Revert "[v1] Add fp32 support to v1 engine through flex attn" (#19404)
|
2025-06-10 01:30:20 -07:00 |
|
|
|
b8089195b4
|
[v1] Add fp32 support to v1 engine through flex attn (#19319)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-06-09 22:10:44 +08:00 |
|
|
|
4555143ea7
|
[CPU] V1 support for the CPU backend (#16441)
|
2025-06-03 18:43:01 -07:00 |
|
|
|
02f0c7b220
|
[Misc] Add SPDX-FileCopyrightText (#19100)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-06-03 11:20:17 -07:00 |
|
|
|
269d901734
|
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix (#18100)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-05-29 07:21:46 +08:00 |
|
|
|
dd5fa7e04f
|
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (#17004)
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>
|
2025-05-21 08:35:00 -07:00 |
|
|
|
e6b8e65d2d
|
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 (#18013)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-05-15 13:26:34 +08:00 |
|