|
|
ac201a0eaf
|
[Feature] Support Decode Context Parallel (DCP) for MLA (#23734)
Signed-off-by: hongchao <hongchao@msh.team>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: hongchao <hongchao@msh.team>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-09-06 13:24:05 +08:00 |
|
|
|
186aced5ff
|
[Kernel] cuda kernels for upcoming decode context parallel feature (#23791)
Co-authored-by: hongchao <hongchao@msh.team>
|
2025-08-28 15:29:11 +08:00 |
|
|
|
19fe1a0510
|
[Kernel] Add FP8 support with FlashMLA backend (#22668)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-08-22 02:26:32 +00:00 |
|
|
|
288cc6c234
|
[Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-02-21 15:30:12 -08:00 |
|
|
|
75e94309e8
|
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
|
2025-02-04 18:22:24 -08:00 |
|
|
|
cabaf4eff3
|
[Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
|
2025-01-30 23:49:37 -08:00 |
|
|
|
e97f802b2d
|
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
|
2025-01-23 18:04:03 +00:00 |
|
|
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
|
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
|
|
5467ac3196
|
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)
|
2024-06-09 16:23:30 -04:00 |
|
|
|
5f6d10c14c
|
[CI/Build] Enforce style for C++ and CUDA code with clang-format (#4722)
|
2024-05-22 07:18:41 +00:00 |
|
|
|
c833101740
|
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)
|
2024-05-09 18:04:17 -06:00 |
|
|
|
20cfcdec99
|
[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659)
|
2024-05-08 12:07:05 -07:00 |
|
|
|
63575bc2e1
|
[Core][Optimization] change python dict to pytorch tensor (#4607)
|
2024-05-06 21:30:27 -07:00 |
|
|
|
43c413ec57
|
[Kernel] Use flashinfer for decoding (#4353)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
|
2024-05-03 15:51:27 -07:00 |
|
|
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
|
|
d6e4a130b0
|
[Minor] Remove gather_cached_kv kernel (#3043)
|
2024-02-26 15:00:54 -08:00 |
|
|
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
|
|
614856da25
|
Avoid multiple redefinition (#1817)
|
2023-12-14 09:35:58 -08:00 |
|
|
|
e0c6f556e8
|
[Build] Avoid building too many extensions (#1624)
|
2023-11-23 16:31:19 -08:00 |
|