Commit Graph

8936 Commits

Author SHA1 Message Date
fce10dbed5 [XPU] Add xpu torch.compile support (#22609)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-08-27 05:33:27 +00:00
d272415e57 [Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs (#22674)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-08-27 05:00:21 +00:00
142ac08030 [Frontend] Optimize beam search performance by limiting concurrency (#23599)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-27 04:59:14 +00:00
3210264421 [Frontend] Add --log-error-stack to print stack trace for error response (#22960)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-27 04:58:59 +00:00
644d57d531 [Model] Add Ernie4.5 VL Model Support (#22514)
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2025-08-26 21:02:55 -07:00
c905684cfe [Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. (#23686)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-08-26 20:05:34 -07:00
786835807b [Bugfix]: Qwen3 Coder Tool Parser (#23099)
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2025-08-26 19:58:32 -07:00
Wei
fecbb7c782 [Bugfix][gpt-oss] passing the cache config in gpt-oss (#23613)
Signed-off-by: Wei Wei <wwei6@meta.com>
2025-08-27 02:54:23 +00:00
6dab89b8ec [Docs] Fix math rendering in docs (#23676)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 18:47:08 -07:00
de02b07db4 [Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 (#23678)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-27 09:34:57 +08:00
eb1995167e [gpt-oss] Enable unit test for response API harmony integration (#23533)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-26 18:23:26 -07:00
2c2b140ae8 [quantization] use channel scales for w4a8 + misc fixes (#23570)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-08-26 18:23:23 -07:00
c7c80af084 fix pynccl reduce_scatter (#23648)
Co-authored-by: hongchao <hongchao@msh.team>
2025-08-26 18:21:11 -07:00
6891205b16 [Feature][Responses API] Support MCP tool in background mode (#23494)
Signed-off-by: wuhang <wuhang6@huawei.com>
2025-08-27 01:06:58 +00:00
b1625dbe9c feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 (#23695)
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com>
2025-08-26 18:06:10 -07:00
585e0bde36 [Bugfix] UnboundLocalError when GptOss reasoning specified (#23054)
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com>
2025-08-27 00:29:52 +00:00
714872f1a9 [Compile] Fix Cmake Warning (#23689)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-26 23:48:32 +00:00
5f1af97f86 [V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 (#22594)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-26 23:28:55 +00:00
c3b0fd1ee6 [V1][P/D]P2pNcclConnector supports flashinfer (#23536)
Signed-off-by: Abatom <abzhonghua@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-08-26 22:56:16 +00:00
6421b66bf4 [Docs] Move quant supported hardware table to README (#23663)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 22:26:46 +00:00
2f13319f47 Enhance the pre-notification policy (#23532)
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com>
2025-08-26 20:41:36 +00:00
d696f86e7b [doc] Hybrid KV Cache Manager design doc (#22688)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 20:19:05 +00:00
9816b81f5f [Model] Enable video support for InternVL3.5 models (#23658)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-26 19:46:52 +00:00
c37c0af990 [Misc] Fix comments in tests/kernels/quantization (#23675)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-08-26 19:31:20 +00:00
9715f7bb0f [Bugfix] Fix incorrect original shape in hashing (#23672)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-08-26 19:01:25 +00:00
98aa16ff41 [v1] Add cross-attention KV cache support for encoder-decoder models (#23664)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-08-26 18:49:06 +00:00
227e231b55 [Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models (#23665)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-26 18:33:16 +00:00
730d0ac8b9 [Docs] Fix warnings in mkdocs build (#23649)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 18:19:23 +00:00
9b0187003e [Bugfix] Fix cuda event usage with CPU model runner (#23643)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-08-26 17:10:42 +00:00
44ac25eae2 [CI] [Doc]: Add GH Action for auto labeling issues with rocm tag (#20988)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-26 16:20:13 +00:00
7ea22e42d5 [Misc] Add override for allreduce fusion thresholds (#23639)
Signed-off-by: Julien Lin <jullin@nvidia.com>
2025-08-26 15:53:04 +00:00
9d4183dd2e [model] support qwen2audio embedding input (#23625)
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-26 23:48:08 +08:00
513298f1b4 [Bugfix] fix bf16 multimodal model hash (#23623)
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-26 23:47:50 +08:00
379f828fba [Docs] Reduce requirements for docs build (#23651)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 15:43:28 +00:00
1fdc732419 [ROCm] Starting to add AMD code reviewers for ROCm components (#23496)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
2025-08-26 07:32:37 -07:00
f58675bfb3 [CPU] add cpu fused moe pytorch native implementation (#23146)
Signed-off-by: Tianyu Li <tianyu.li@arm.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2025-08-26 14:09:17 +00:00
7c04779afa [Doc]: fix various spelling issues in multiple files (#23636)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-08-26 14:05:29 +00:00
f66673a39d [Kernel] Added flashinfer fp8 per-tensor gemms (#22895)
Signed-off-by: Julien Lin <jullin@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-26 06:54:04 -07:00
b78bed1bc5 [Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) (#23565)
Signed-off-by: oye93 <en.ouyang93@outlook.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2025-08-26 13:04:25 +00:00
164b2273c8 [Docs] Fix broken links to docs/api/summary.md (#23637)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 13:00:18 +00:00
2b4fc9bd9b Support FlashAttention Backend for Hybrid SSM Models (#23299)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-26 12:41:52 +00:00
ebd5a77bb5 feat: add usage to TranscriptionResponse (text and json response_format) (#23576)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2025-08-26 05:26:26 -07:00
384dd1b0a8 [Bugfix] Add missing enable_log_outputs parameter to init_app_state function (#23634)
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com>
2025-08-26 12:13:15 +00:00
fdeb3dac13 [Model] fix DeepSeek e_score_correction_bias dtype to fp32 (#23640)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-26 20:09:47 +08:00
d52358c1e0 [Perf] Remove duplicated NVFP4 blockscales to save memory (#23379)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-26 19:16:33 +08:00
6ace2f72b0 Fix writing benchmark results with tuple keys (#23633)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-08-26 19:16:09 +08:00
b00e69f8ca Fix nits from #20059 (#23548)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-26 03:27:20 -07:00
50fede6634 [V1] Enable V1 for compute capability < 8.0 + FP32 (#23614)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-26 03:00:18 -07:00
b5d34af328 [Bugfix] Fix scheduling when repeated images in one request (#23544)
Signed-off-by: Roger Wang <hey@rogerw.me>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.me>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
2025-08-26 09:46:28 +00:00
9b5f64238f [Bugfix] Fix Qwen25VL packed_modules_mapping (#23604)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-26 01:09:14 -07:00