|
|
9fb2d22032
|
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-17 09:56:44 -04:00 |
|
|
|
5a7fb3ab9e
|
[Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820)
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
|
2025-07-17 09:10:09 +00:00 |
|
|
|
7bd4c37ae7
|
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 09:23:23 +00:00 |
|
|
|
31d5c1797f
|
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf (#19830)
Signed-off-by: Luka Govedic <lgovedic@redhat.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-11 04:56:28 +00:00 |
|
|
|
e2de455c34
|
[Feature] Integrate SM100 DeepGEMM support (#20087)
|
2025-07-10 20:18:05 -07:00 |
|
|
|
0bbac1c1b4
|
[Bench] Add NVFP4 GEMM benchmark script (#20578)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-09 13:23:48 -04:00 |
|
|
|
cede942b87
|
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py (#20516)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-07-06 09:20:11 +00:00 |
|
|
|
1caca5a589
|
[Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-04 07:40:42 +00:00 |
|
|
|
c1909e7e8c
|
[Kernels] MoE refactor (#19636)
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-02 06:08:27 -07:00 |
|
|
|
9909726d2a
|
Enable ZP Support for Machete (#20268)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
|
2025-07-01 07:12:20 +00:00 |
|
|
|
a6c4b87fbc
|
Revert "[Feature] Integrate new deepgemm (#19820)" (#20049)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-06-24 19:45:22 -07:00 |
|
|
|
c6e3bba8e6
|
[Feature] Integrate new deepgemm (#19820)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-06-24 12:51:56 -07:00 |
|
|
|
4671ac6e2a
|
[Bugfix][Benchmark] Fix Marlin benchmark (#19929)
|
2025-06-24 07:25:12 +09:00 |
|
|
|
dfada85eee
|
[Frontend] Expose custom args in OpenAI APIs (#16862)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-06-18 17:41:11 -07:00 |
|
|
|
ffb2cd6b54
|
[Perf] Optimize moe_align_block_size CUDA kernel (#19572)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-06-17 11:49:26 -07:00 |
|
|
|
3d330c4c09
|
[Benchmark] Refactor benchmark script for fp8 & int8 (#19627)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-06-15 15:15:37 +08:00 |
|
|
|
b6efafd9e4
|
[Perf] Vectorize static / dynamic INT8 quant kernels (#19233)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-06-12 06:51:41 -07:00 |
|
|
|
4589b94032
|
[Bugfix] Fix benchmark_moe.py (#19016)
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn>
|
2025-06-09 18:04:36 -07:00 |
|
|
|
84166fee97
|
[Kernel] Integrate CUTLASS MoE kernel with PPLX (#18762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-06-06 18:26:11 -07:00 |
|
|
|
3465b87ef8
|
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B (#19033)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
2025-06-05 19:10:08 -07:00 |
|
|
|
61059bee40
|
[Hardware][NVIDIA] FP4 MoE kernel optimization (#19110)
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com>
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com>
|
2025-06-05 09:48:26 -07:00 |
|
|
|
02f0c7b220
|
[Misc] Add SPDX-FileCopyrightText (#19100)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-06-03 11:20:17 -07:00 |
|
|
|
f49239cb45
|
Benchmark script for fp8 vs bf16 gemm (#17126)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-05-30 10:56:11 -06:00 |
|
|
|
1aa2f81b43
|
[Misc] Update type annotation for rotary embedding base (#18914)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-05-30 10:17:01 +08:00 |
|
|
|
4fc1bf813a
|
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking (#18454)
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com>
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com>
|
2025-05-23 16:16:26 -07:00 |
|
|
|
dd5fa7e04f
|
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (#17004)
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>
|
2025-05-21 08:35:00 -07:00 |
|
|
|
009d9e7590
|
Convert benchmarks to ruff format (#18068)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-05-13 13:43:29 +00:00 |
|
|
|
0c0fdae84f
|
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362)
|
2025-05-09 16:24:41 -07:00 |
|
|
|
0a9bbaa104
|
[Misc] support model prefix & add deepseek vl2 tiny fused moe config (#17763)
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com>
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com>
|
2025-05-08 07:50:22 +00:00 |
|
|
|
f9bc5a0693
|
[Bugfix] Fix triton import with local TritonPlaceholder (#17446)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-05-06 17:53:09 +08:00 |
|
|
|
9352cdb56d
|
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning (#16263)
Signed-off-by: Lu Fang <lufang@fb.com>
Co-authored-by: Lu Fang <lufang@fb.com>
|
2025-05-02 19:44:19 +00:00 |
|
|
|
3e887d2e0c
|
permute/unpermute kernel for moe optimization (#14568)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
|
2025-05-02 11:31:55 -07:00 |
|
|
|
8fc88d63f1
|
[Model] Add tuned triton fused_moe configs for Qwen3Moe (#17328)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-04-28 15:20:24 -07:00 |
|
|
|
423e9f1cbe
|
Use Transformers helper get_text_config() instead of checking for text_config (#17105)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-04-25 08:47:35 -07:00 |
|
|
|
2f54045508
|
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton (#15099)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-04-24 22:51:02 -07:00 |
|
|
|
8d32dc603d
|
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com>
|
2025-04-22 09:01:36 +01:00 |
|
|
|
55dcce91df
|
Upstream Llama4 Support to Main (#16113)
Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com>
Signed-off-by: Chris Thi <chris.c.thi@gmail.com>
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Jon Swenson <jmswen@gmail.com>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Lu Fang <fanglu@meta.com>
Signed-off-by: Xiaodong Wang <xdwang@meta.com>
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-07 08:06:27 -07:00 |
|
|
|
e59ca942f5
|
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
Signed-off-by: Bill Nell <bnell@redhat.com>
|
2025-04-01 12:07:43 -04:00 |
|
|
|
9239bf718e
|
[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>
|
2025-03-27 00:54:44 +00:00 |
|
|
|
23114d3364
|
[Misc] Warn about v0 in benchmark_paged_attn.py (#15495)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-03-25 20:31:04 -07:00 |
|
|
|
f90d34b498
|
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 (#15322)
Signed-off-by: DefTruth <qiustudent_r@163.com>
|
2025-03-23 01:10:10 -07:00 |
|
|
|
400d483e87
|
[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-18 09:47:53 +00:00 |
|
|
|
a73122de96
|
[Bugfix] fix benchmark moe (#14653)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-13 16:12:42 +08:00 |
|
|
|
a1c8f3796c
|
dynamic distpatch of fp8 kernels (#14245)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
|
2025-03-11 10:54:56 -04:00 |
|
|
|
5ff0d32580
|
[V1] LoRA - Add triton kernels for V1 (#13096)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-03-10 17:27:53 -04:00 |
|
|
|
3b352a2f92
|
Correct capitalisation: VLLM -> vLLM (#14562)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-03-10 16:36:21 +00:00 |
|
|
|
c34eeec58d
|
[Bugfix] Correctly call cudaProfilerStop in benchmarks script (#14183)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-03-07 00:42:49 +00:00 |
|
|
|
ca100c90fe
|
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-03-05 17:08:51 -08:00 |
|
|
|
7bab4bb048
|
[Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-03-05 23:11:29 +08:00 |
|
|
|
f78c0be80a
|
Fix benchmark_moe.py tuning for CUDA devices (#14164)
|
2025-03-03 21:11:03 -08:00 |
|