6f62c94d7e
updated
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2025-10-03 13:47:16 -04:00
52a7d91980
debug
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2025-10-03 13:25:00 -04:00
d76541a6c5
Stop mergify from keeping stale PRs alive ( #26169 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-10-03 16:42:34 +00:00
dd96465fd7
[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 ( #26123 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-10-03 08:52:26 -07:00
4f8f47e87e
Fix undefined symbol: cutlass_moe_mm_sm100 ( #26098 )
...
Signed-off-by: Jun Jiang <jasl9187@hotmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-10-03 15:48:32 +00:00
d78fda7cda
[Renderer] Move Processor out of LLMEngine ( #26165 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-03 15:08:22 +00:00
73a99cc2a5
[Model] Fixed stream generator for gpt-oss + spec-decoding ( #26027 )
...
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com >
2025-10-03 13:43:41 +00:00
adae0c1f43
[CI/Build] do not enforce precompilation on tpu ci tests ( #25992 )
...
Signed-off-by: Xiang Si <sixiang@google.com >
2025-10-03 13:38:42 +00:00
cbf9221992
[Model] Supplement to PR 24862: Pass param prefix to LLMHead ( #25805 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-10-03 21:34:53 +08:00
5f42fc53b6
[backends][short_conv] CUDA graph piecewise edits ( #24215 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2025-10-03 12:59:48 +00:00
8ee846c27c
[Bugfix] Re-enable prefill of max model length ( #24446 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
2025-10-03 14:13:34 +02:00
812b7f54a8
[Renderer] Move Processor out of AsyncLLM ( #24138 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-03 11:29:45 +00:00
5f2cacdb1e
Quick fix for IMA with the Prefix Prefill kernel during graph capture ( #25983 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-10-03 11:28:22 +00:00
aa5053e3fe
[Doc] Fixed shape description for fused_batched_moe.py ( #25668 )
...
Signed-off-by: Egor <e.a.krivov@gmail.com >
2025-10-03 04:00:23 -07:00
79aa244678
[Multi Modal] Configurable MM Profiling ( #25631 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-10-03 03:59:10 -07:00
2ed3f20dba
[openai] Fix missing tool usage check (system message) ( #24768 )
...
Signed-off-by: kyt <eluban4532@gmail.com >
2025-10-03 18:55:44 +08:00
48f309029a
[NIXL][Misc] Expose metrics from NIXL for logging to CLI ( #25388 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-10-03 10:47:59 +00:00
0e93ac0b3a
[CI] Fix distributed hybrid tests in CI ( #26155 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-10-03 09:14:18 +00:00
5446ad1d24
[test utils] correct wrong typing ( #26159 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
2025-10-03 02:11:49 -07:00
f9a8084e48
[Model] Use merge_by_field_config for MM models (InternVL family) ( #26153 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-03 01:59:06 -07:00
3e70e3d4d5
add(v1): RequestStatesStats to RequestOutput ( #24947 )
...
Signed-off-by: huijjj <huijong.jeong@squeezebits.com >
2025-10-03 08:56:25 +00:00
eb0fa43868
[Perf] Optimize reshape_and_cache CUDA Kernel ( #25955 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Liu-congo <1502632128@qq.com >
2025-10-03 01:33:46 -07:00
0ad9951c41
[Input] Remove unused prompt field ( #26097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-03 00:23:21 -07:00
8c9117181d
[Misc] Remove typing.List ( #26150 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-10-03 07:00:33 +00:00
c4b48d3c0f
[BUG] Reorder model config creation ( #26124 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-10-03 14:59:36 +08:00
10d765482d
FusedMoE support for the Transformers backend (#22650 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-10-02 23:12:15 -07:00
39b643dc1a
[Model] Use merge_by_field_config for MM models (G) ( #26117 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-02 22:38:29 -07:00
711f485643
[Bugfix] Fix import gemm_afp4wfp4 failure on AMD ( #26068 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-10-02 22:37:25 -07:00
9c5ee91b2a
[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm ( #26104 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-10-02 22:34:53 -07:00
27edd2aeb4
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv ( #26103 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-10-02 22:21:01 -07:00
e5017cd6d6
[gpt-oss] disable tool server initialization if no tool in request ( #25790 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-10-03 05:08:35 +00:00
6a7796e871
[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small ( #26144 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-10-03 04:00:20 +00:00
47b9339546
[DeepSeek] Improve performance of DS MLA cache kernel ( #26132 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-10-02 20:35:47 -07:00
5d5146eee3
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper ( #26138 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-10-02 20:32:38 -07:00
2aaa423842
[Attention] Move Backend enum into registry ( #25893 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-10-02 20:32:24 -07:00
ad2d788016
[Bug][Benchmark] Fix duplicate req in oversampling ( #26140 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-10-03 02:55:24 +00:00
36ce76c632
[Log] Optimize DeepGEMM Missing Log ( #26106 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-02 20:02:26 -06:00
f1fc2107a3
[Bugfix] Disable cascade attention with FlashInfer ( #26130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-02 16:30:37 -07:00
13cdc02173
Fix MTP with deepep_low_latency ( #25904 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-10-02 21:29:49 +00:00
502640c3f9
[Perf] Fix and reapply move apply w8a8 block fp8 linear to class ( #25696 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-10-02 19:35:13 +00:00
3d5f1c8640
[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP ( #25119 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-10-02 18:48:31 +00:00
1cab2f9cad
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench ( #25916 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-10-02 11:29:35 -07:00
1e50f1be70
[Deepseek v3.2] Support indexer prefill chunking ( #25999 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-10-02 10:29:12 -07:00
ad87ba927a
[Small] Prevent bypassing media domain restriction via HTTP redirects ( #26035 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-10-02 10:27:10 -07:00
decf7f794b
[BugFix] Fix FI accuracy issue when used for MLA prefill ( #26063 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-10-02 17:18:13 +00:00
d00d652998
[CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command ( #25967 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-02 10:04:57 -07:00
3b279a84be
[CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests ( #26040 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-10-02 09:07:19 -07:00
5e4a8223c6
[Qwen][ROCm] Flash Attention Rotary Embeddings ( #24642 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-10-02 08:26:08 -07:00
e51de388a2
[Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU ( #25470 )
...
Signed-off-by: leo-pony <nengjunma@outlook.com >
2025-10-02 23:19:22 +08:00
cc253b73d3
[Model] Use merge_by_field_config for MM models (D-F) ( #26076 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-02 08:17:35 -07:00
7d6fb905d9
[Model] Use merge_by_field_config for MM models (A-C) ( #26073 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-02 08:17:31 -07:00
418d111f8c
[FA/Chore] Bump vllm-flash-attention ( #25537 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-10-02 11:06:14 -04:00
be8921fbba
Change size of single CUDA graph for CI to 4 ( #26089 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-10-02 14:14:28 +00:00
d4e7a1152d
Update base image to 22.04 (jammy) ( #26065 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-10-02 05:48:04 -07:00
be22bb6f3d
Run:ai model streamer add GCS package support ( #24909 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
2025-10-01 20:59:13 -07:00
169313b9f8
[Misc] Make handling of SamplingParams clearer in n>1 case ( #26032 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-10-01 19:31:39 -07:00
0b018d8baf
[ROCm][Bugfix] Add missing parameter to ROCm backend ( #26029 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-10-01 19:23:14 -07:00
c31246800c
Support RL online quantization with torchao ( #23014 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-10-01 16:39:29 -07:00
4134312b35
[BugFix] ChunkedLocalAttention is currently not CG compatible ( #26034 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-10-01 16:28:00 -07:00
da554f932e
[Bug] Fix Negative Cuda Memory Usage ( #25683 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-01 18:16:26 -04:00
aac622e0cd
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series ( #25908 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-10-01 21:39:49 +00:00
1726e93ef1
[BugFix][DP/EP] Fix CUTLASS MLA hang under load ( #26026 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-10-01 12:30:00 -07:00
ee04c0cd04
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability ( #26030 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-10-01 12:02:17 -07:00
c36f0aa300
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets ( #25995 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-10-01 18:18:36 +00:00
5234dc7451
[NVIDIA] Blackwell Family ( #24673 )
...
Signed-off-by: Johnny <johnnynuca14@gmail.com >
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnync13@gmail.com >
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com >
Co-authored-by: Salvatore Cena <cena@cenas.it >
2025-10-01 10:50:54 -07:00
3b7c20a6b5
[Bugfix] Apply same sampling parameters for both n=1 and n>1 ( #26005 )
...
Signed-off-by: Kenichi Maehashi <maehashi@preferred.jp >
2025-10-01 14:37:35 +00:00
f9e714813a
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type ( #26007 )
...
Signed-off-by: Nathan Scott <nathans@redhat.com >
2025-10-01 12:41:57 +00:00
2518230d3e
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 ( #25829 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
Co-authored-by: Luka Govedic <ProExpertProg@users.noreply.github.com >
2025-10-01 08:39:45 -04:00
a332b84578
[CI] Only capture a single CUDA graph size in CI by default ( #25951 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-10-01 10:03:44 +01:00
1405f0c7ba
[Misc] Factor out common _apply_feature_select_strategy ( #26003 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-01 01:31:03 -07:00
84d57342b6
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker ( #26004 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-10-01 08:03:25 +00:00
57b46d769e
[Doc] updating torch.compile doc link ( #25989 )
...
Signed-off-by: nadathurv <work.vnadathur@gmail.com >
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com >
Co-authored-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com >
2025-10-01 07:04:56 +00:00
f48b6a03ba
[Misc]allow disable pynccl ( #25421 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-10-01 06:04:13 +00:00
2a69ab4899
Update to Transformers v4.56.2 ( #24638 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-30 22:07:07 -07:00
8d7da92fd7
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 ( #25988 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 21:58:31 -07:00
e952eee698
[Bugfix] Fix __syncwarp on ROCM ( #25996 )
2025-09-30 21:15:11 -07:00
66bca9b8bd
[MM] Add text-only mode for Qwen3-VL ( #26000 )
2025-09-30 21:13:42 -07:00
99028fda44
Fix INT8 quantization error on Blackwell GPUs (SM100+) ( #25935 )
...
Signed-off-by: padg9912 <phone.and.desktop@gmail.com >
2025-09-30 19:19:53 -07:00
1244948885
[Log] Optimize Log for FP8MOE ( #25709 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-30 19:18:43 -07:00
a73f6491c8
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning ( #25843 )
...
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-30 19:18:19 -07:00
001e50c92c
[Model] MTP fallback to eager for DeepSeek v32 ( #25982 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-10-01 01:53:22 +00:00
96ebcaa3ad
[Misc] Make EP kernels install script support uv ( #25785 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 23:38:34 +00:00
5db1870bb9
[gpt-oss] use vLLM instead of openai types for streaming ( #25186 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-09-30 22:47:07 +00:00
2ce26b9b5d
[Docs] Remove API Reference from search index ( #25949 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 22:10:02 +00:00
a388252ac4
Add explicit pooling classes for the Transformers backend ( #25322 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-30 23:07:06 +01:00
9a9f48dff7
[V1] [P/D] Add Support for KV Load Failure Recovery ( #19330 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-09-30 14:57:08 -07:00
67f3fb0844
[Bench] Add DeepSeekV32 to MoE benchmark ( #25962 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-30 14:13:48 -07:00
43b752c325
[Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding ( #25889 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-30 20:35:15 +00:00
cfd302db9b
OffloadingConnector: Fix GPU block tracking bug ( #25856 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-30 19:53:04 +00:00
fb610ae684
[Docs] Add moe kernel features doc ( #25297 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 19:03:15 +00:00
2f652e6cdf
[Doc] Improve MM Pooling model documentation ( #25966 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-30 18:58:29 +00:00
e6a226efba
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' ( #25958 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-30 11:13:03 -07:00
a2e6fa7e03
[bugfix][deepseek] fix flashmla kernel selection ( #25956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-10-01 00:30:36 +08:00
9f1c4ecaf2
[Bugfix] Token type and position embeddings fail to be applied to inputs_embeds ( #25922 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-01 00:23:12 +08:00
ef283548f7
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging ( #25895 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-09-30 10:51:31 -04:00
f4db5e6de1
[Bugfix][Model] Fix inference for Hunyuan dense models ( #25354 )
...
Signed-off-by: anion <1005128408@qq.com >
Signed-off-by: Anion <123177548+Anionex@users.noreply.github.com >
2025-09-30 14:38:07 +00:00
099aaee536
Add Hugging Face Inference Endpoints guide to Deployment docs ( #25886 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 14:35:06 +00:00
35fe398c7c
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 ( #25858 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-30 07:30:44 -07:00
bb6d43047e
[Fix] Improve CPU backend compatibility for RISC-V ( #25816 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
2025-09-30 13:48:07 +00:00
bc546f76a1
[CI] Move applicable tests to CPU ( #24080 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 14:45:20 +01:00
80608ba5af
[NIXL] Add support for MLA caches with different latent dim ( #25902 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-09-30 12:18:29 +00:00
e184c9c510
[perf] Use CPU tensor to reduce GPU->CPU sync ( #25884 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com >
2025-09-30 19:51:16 +08:00
d7e34b4210
[Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs ( #25938 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-30 11:24:57 +00:00
ef6e0e7132
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 ( #25936 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-30 19:11:21 +08:00
1ad3aca682
Updated TRL integration docs ( #25684 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 03:10:55 -07:00
8d0afa9b42
[Doc] Add Cambricon MLU support ( #25942 )
...
Signed-off-by: a120092009 <zhaoty0121@gmail.com >
2025-09-30 17:59:47 +08:00
fa7e254a7f
[New Model] DeepSeek-V3.2 (Rebased to Main) ( #25896 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
Signed-off-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com >
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
2025-09-30 17:14:41 +08:00
e23cacda35
[Bugfix]: Clean up chunked prefill logging when using whisper ( #25075 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2025-09-30 08:17:49 +00:00
2e1b8bc2b6
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not ( #25925 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
2025-09-30 08:15:23 +00:00
e47433b3c1
[BugFix] Pass config_format via try_get_generation_config ( #25912 )
2025-09-30 05:09:50 +00:00
23194d83e8
[BugFix] Fix DP/EP hang ( #25906 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 04:18:59 +00:00
61aedb5ffe
MoveVllmConfig from config/__init__.py to config/vllm.py ( #25271 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-29 19:49:49 -07:00
d3bd171123
[Benchmark] Support benchmark throughput for external launcher DP ( #25913 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-30 01:43:57 +00:00
89e4050af4
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 ( #25909 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-30 09:15:19 +08:00
78a47f87ce
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models ( #25717 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-30 08:10:58 +08:00
6a113d9aed
[V0 Deprecation] Remove vllm.worker and update according imports ( #25901 )
2025-09-29 23:26:11 +00:00
2e4fe48c37
[NIXL] Increase default KV block eviction timeout on P ( #25897 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-29 21:35:14 +00:00
8eb0a1d906
[Doc] Polish example for torchrun dp ( #25899 )
2025-09-29 21:31:34 +00:00
fea3e476aa
[Kernel] Chunk-aligned mamba2 ( #24683 )
2025-09-29 23:18:25 +02:00
61a3431613
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so ( #25605 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-29 17:01:50 -04:00
9bedac9623
[Doc] Add documentation for vLLM continuous benchmarking and profiling ( #25819 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-09-29 20:49:49 +00:00
c42ff4f4fd
[BugFix][torch.compile] KV scale calculation issues with FP8 quantization ( #25513 )
...
Signed-off-by: adabeyta <aabeyta@redhat.com >
2025-09-29 15:52:04 -04:00
d5ab28511c
[Bugfix] Use correct key "ignore" for config.json non-quantized layers ( #25706 )
...
Signed-off-by: Lee Nau <lnau@nvidia.com >
2025-09-29 15:07:29 -04:00
e61eb5e09d
[Model] Remove MotifForCausalLM ( #25866 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-30 00:36:30 +08:00
0899ba5b42
[CI/Build] Include Transformers backend test in nightly transformers test ( #25885 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-29 09:33:39 -07:00
145ac73317
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue ( #25883 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-09-29 11:37:20 -04:00
d0d138bc55
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) ( #24690 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
2025-09-29 14:31:51 +00:00
43227236ec
[torch.compile] serialize cudagraph_mode as its enum name instead of value ( #25868 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-29 13:54:52 +00:00
8616300ae2
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models ( #25854 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
2025-09-29 10:59:04 +00:00
edbaadd91f
[Bugfix] Fix requirements paths in install instructions ( #25827 )
...
Signed-off-by: yingjun-mou <renzomou@gmail.com >
2025-09-29 03:49:35 -07:00
9360d34fa1
update to latest deepgemm for dsv3.2 ( #25871 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-29 17:51:43 +08:00
1b67b04656
[Misc] Remove more get_input_embeddings_v0 ( #25857 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-29 08:03:37 +00:00
bd51f78e39
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge ( #25331 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: isotr0py <2037008807@qq.com >
2025-09-29 14:09:18 +08:00
65ecb4f134
[Bugfix] Fallback ViT attn backend to SDPA for blackwell ( #25851 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-29 06:03:51 +00:00
143844fa43
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph ( #25847 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-29 05:15:10 +00:00
219cfbe7f6
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS ( #25832 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-29 05:08:17 +00:00
9b44a7d926
[P/D] NIXL Updates ( #25844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-09-29 04:46:30 +00:00
a3ae45a38c
[Misc] fix tests failure by using current_platform ( #25825 )
...
Signed-off-by: Juechen Liu <jueliu@meta.com >
2025-09-29 04:18:57 +00:00
0307428d65
Remove redundant cudagraph dispatcher warning ( #25841 )
2025-09-28 17:12:42 -04:00
471997adf6
[Bugfix] fix Qwen3VLMoe load when pp > 1 ( #25838 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
2025-09-28 17:56:12 +00:00
b1ded114b9
Update GLM-4.5 Doc transformers version ( #25830 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-09-28 12:05:51 +00:00
f4e4088c99
Fix random dataset mismatched token length with config. ( #24937 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-28 08:23:44 +00:00
0efd540dbc
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling ( #25557 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-28 04:21:01 +00:00
6144754014
[Bugfix] Fix Qwen3-VL regression from #24982 ( #25814 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-28 03:21:09 +00:00
69311446ba
[MM] Optimize memory profiling for scattered multimodal embeddings ( #25810 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-28 02:17:58 +00:00
da63274d9f
[Bugfix][NIXL] Fix Async Scheduler timeout issue ( #25808 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-27 15:17:35 -04:00
c216119d64
[Core] GC Debug callback ( #24829 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Co-authored-by: Jialin Ouyang <jialino@meta.com >
2025-09-27 17:53:31 +00:00
5546acb463
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location ( #25766 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
2025-09-27 13:36:28 -04:00
c0ec81836f
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable ( #25651 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-27 16:09:00 +00:00
b65e56babe
[Core] Refactor self.model() to call a helper for subclassing. ( #25084 )
...
Signed-off-by: Patrick Toulme <ptoulme@meta.com >
Signed-off-by: Patrick Toulme <pctoulme+1@gmail.com >
2025-09-27 08:40:59 -07:00
49996cd597
[env] default nixl side port conflicts with kv-event zmq port ( #25056 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-27 15:02:40 +00:00
ecb37e276a
[docs] transcriptions API audio upload ( #25446 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-27 15:00:35 +00:00
a5354b3ed2
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models ( #24982 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-27 14:22:28 +00:00
f9df8b4ad7
[Bugfix] Fix triton import precommit failure ( #25803 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-27 07:13:11 -07:00
ec152c8748
Fix GPTQ model loading in Transformers backend ( #25770 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-27 12:18:20 +00:00
7977e5027c
Add filtering for chat template kwargs ( #25794 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-27 10:46:49 +00:00
3f5d902d2a
Validate API tokens in constant time ( #25781 )
...
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
2025-09-27 18:09:26 +08:00
27d7638b94
[Bugfix] Merge MM embeddings by index instead of token IDs ( #16229 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-27 08:15:12 +00:00
176173989a
[Bugfix] Add missing image_size for phi4_multimodal ( #25796 )
2025-09-27 07:59:22 +00:00
23b8ee672d
[Misc] Update openai client example file for multimodal ( #25795 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-27 07:57:07 +00:00
3939152069
[Misc] Fix codeowners override for v1 sample and attention ( #25037 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-27 07:47:29 +00:00
cd87bfbf37
[CI/Build] Reorganize root-level V1 tests ( #25767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-27 13:51:15 +08:00
b3613e3ace
[CI/Build] Add timing to Model Executor Test ( #25799 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-26 21:57:27 -07:00
d346ec695e
[CI/Build] Consolidate model loader tests and requirements ( #25765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 21:45:20 -07:00
c242c98031
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL ( #25788 )
2025-09-26 20:44:52 -07:00
f1d53d150c
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl ( #22872 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
2025-09-27 03:35:47 +00:00
92da847cf5
Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile ( #25782 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 18:54:09 -07:00
3958b96bf5
Add option to restrict media domains ( #25783 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-27 01:23:52 +00:00
8bf8f45822
[Core] Don't count preempted tokens in prefix cache hit rate ( #25787 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-27 00:16:40 +00:00
6f5c0931c1
[Spec decode] automatically disable mm for text-only draft models ( #25667 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2025-09-27 08:10:21 +08:00
4e33a7ea85
[Bugfix] Optimize CpuGpuBuffer initialization ( #25447 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-09-27 08:07:36 +08:00
dc48ba0c75
Kernel-override Determinism [1/n] ( #25603 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com >
2025-09-26 16:59:09 -07:00
4778b42660
Reduce the Cuda Graph memory footprint when running with DBO ( #25779 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-26 22:29:56 +00:00
c70ac4b8ff
[spec decode] Consolidate speculative decode method name for MTP ( #25232 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
2025-09-26 22:27:05 +00:00
cf89202855
[CI] Fix FlashInfer AOT in release docker image ( #25730 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 14:11:40 -07:00
f075693da7
[V1] address post issues related to #20059 (part 1) ( #23046 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-26 15:58:19 -04:00
f708bd4904
[CI] Add E2E Blackwell Quantized MoE Test ( #25723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 12:23:00 -07:00
0002b7f0d1
[Docs] Add Toronto Meetup ( #25773 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 12:00:46 -07:00
11aafd9886
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition ( #25355 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-26 11:54:00 -07:00
b761df963c
[Doc]: improve CPU(x86) build-wheel-from-source section ( #25617 )
...
Signed-off-by: Kosseila (CloudThrill) <klouddude@gmail.com >
2025-09-26 10:26:33 -07:00
33f6aaf972
Eagle3 that supports the Minicpm3 model ( #24243 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: liudan <adan@minicpm.com >
Co-authored-by: liudan <liudan@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
2025-09-26 10:04:57 -07:00
56aafa8c0b
[Misc] fix unique_filepath ( #25732 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-26 16:56:15 +00:00
8d52f2b3a7
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray ( #25439 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-09-26 09:43:30 -07:00
984d18498a
[BugFix] Fix using dbo_decode_token_threshold always (and ignoring dbo_prefill_token_threshold) ( #25622 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-26 16:22:49 +00:00
d4d9899860
[Quantization] Add field to skip unquantized modules for GPTQ config ( #25455 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-26 15:47:41 +00:00
db1e42f627
[CI/Build] Fix some V1 tests not being run ( #25569 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 20:52:36 +08:00
bc9d7b5595
[CI/Build] Split up Distributed Tests ( #25572 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 14:49:33 +02:00
fe6b19c314
[Bugfix] Properly abort pooling request. ( #25734 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-26 05:47:34 -07:00
2827b3f4a3
[CI] Fix test_shared_storage_connector_hashes ( #25748 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-26 20:46:17 +08:00
2b6b1d7809
[Model] Mamba2 varlen refactor ( #21467 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: RishiAstra <40644327+RishiAstra@users.noreply.github.com >
2025-09-26 11:31:14 +00:00
633f943e30
[Doc] Update Batch-level DP docs ( #25757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 02:37:40 -07:00
b03b1b97f6
Support LongCat-Flash-Chat tool call ( #24083 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-09-26 09:25:39 +00:00
dfb9af2014
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk ( #25698 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-26 01:25:28 -07:00
19f76ee68e
[misc] refactor speculative config ( #25657 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-26 01:22:06 -07:00
dd70437a4f
Remove cuda hard-code in compute_causal_conv1d_metadata ( #25555 )
...
Signed-off-by: Icey <1790571317@qq.com >
2025-09-26 01:19:20 -07:00
99b3a504c5
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. ( #25743 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-26 01:18:58 -07:00
6e30010d2f
fix: print outputt offline_inference/base/chat.py example ( #25744 )
...
Signed-off-by: Iceber Gu <caiwei95@hotmail.com >
2025-09-26 01:18:24 -07:00
52621c8f5c
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X ( #25703 )
...
Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com >
2025-09-26 01:18:20 -07:00
d48f4d6daf
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled ( #25739 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-26 01:18:09 -07:00
e84e0735c7
fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions ( #25738 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-26 01:18:05 -07:00
3edf87d25f
[CI/Build] fix doc build warning: Failed to get 'name: description' pair ( #25733 )
...
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io >
2025-09-26 01:18:02 -07:00
392edee34a
EVS Support (Video tokens pruning) ( #22980 )
...
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com >
Signed-off-by: Eugene Khvedchenya <ekhvedchenya@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-26 11:54:54 +08:00
983056e456
[Misc] Remove unnecessary memoryviews in shm_broadcast.py ( #25721 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-26 03:11:44 +00:00
13dd93c667
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder ( #25701 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-25 18:21:56 -07:00
53a30845be
Llamas 3.1 405B fp4 changes upstreaming from 355_wip ( #25135 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2025-09-25 19:16:53 -06:00
8b77328ffe
[Misc] Don't log shm dequeue delay warning on worker side ( #25720 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-26 01:08:30 +00:00
9fe4c2bdb9
[Refactor] Remove DeepGEMM OP Register ( #25710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-25 20:13:41 -04:00
081b5594a2
Fix routing_bias dtype ( #25711 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
2025-09-25 23:35:14 +00:00
57329a8c01
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 ( #25708 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2025-09-25 16:10:29 -07:00
8c435c9bce
[Core] Enable command line logging for LLMEngine ( #25610 )
...
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-25 15:31:17 -07:00
e71b8e210d
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. ( #24986 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-25 15:22:03 -07:00
89fa54e6f7
[Optimization] Use a cheaper cache key in get_model_architecture ( #25682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 17:54:20 -04:00
3d54bdcb73
[Optimization] Streamline InputPreprocessor ( #25702 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 21:06:49 +00:00
6b0fcbbf43
[Misc] Simplify test_argsort_mm_positions ( #25690 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 18:23:01 +00:00
0fa673af4c
[V0 deprecation] Clean up LoRA ( #25686 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-25 18:12:33 +00:00
3468f17ebe
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names ( #25489 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-25 17:37:50 +00:00
71b25b0d48
[V0 deprecation] Clean up V0 fallback in compilation config ( #25675 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 17:29:51 +00:00
0ea80c87d9
[Model] Define merge_by_field_config MM interface ( #25676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 17:13:07 +00:00
b8d9e4a326
[Model] Add optional parameter to reasoning parser constructor ( #25554 )
...
Signed-off-by: taohui <taohui3@gmail.com >
Signed-off-by: Tao Hui <taohui3@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-26 01:12:50 +08:00
13cc7f5370
[BugFix] Fix DBO hang ( #25625 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-25 17:04:48 +00:00
916bd9204d
Revert "[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super()" ( #25681 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-25 09:45:06 -07:00
e04a1b6b21
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… ( #24662 )
...
Signed-off-by: AlonKejzman <alonkeizman@gmail.com >
2025-09-25 15:40:14 +00:00
2e5df88c92
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning ( #25532 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-25 15:16:06 +00:00
0754ac4c49
[Misc] Remove cruft file in repo ( #25678 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-25 08:05:12 -07:00
03858e6d1c
[Bugfix] Fix InternS1 video processing after Transformers v4.56 ( #25644 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 14:46:04 +00:00
532a6cfccb
[ux] Switch a warning to debug about a pytorch fallback ( #23750 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-25 14:38:16 +00:00
eb32335e35
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata ( #25652 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-25 13:29:11 +00:00
69a8c8e99a
[torch.compile] Make Query Quantization Fusable ( #24914 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2025-09-25 09:25:12 -04:00
6c340da4df
[misc] log info messages by default for hanging / busy / idle ( #25627 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-25 21:14:57 +08:00
2f17117606
[mypy] Fix wrong type annotations related to tuple ( #25660 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 13:00:45 +00:00
1e9a77e037
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar ( #22112 )
...
Signed-off-by: chenlang <chen.lang5@zte.com.cn >
Co-authored-by: chenlang <10346245@zte.com.cn >
2025-09-25 20:46:11 +08:00
d2af67441d
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash ( #25643 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-25 12:38:11 +00:00
0bcc3a160d
[CI/Build] Fix flaky entrypoints test ( #25663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 12:19:40 +00:00
70fbdb26e9
Add backward compatibility for guided_... API ( #25615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-25 19:45:25 +08:00
7f570f1caa
[V0 deprecation] Remove unreachable model_config.supported_tasks ( #25642 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-25 11:26:31 +00:00
eaeca3cd7f
[Bugfix] Parse SpeculativeConfig Error ( #25142 )
...
Signed-off-by: zxw <1020938856@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-25 11:09:39 +00:00
12c1287d64
[mypy] Further improve MM type annotations ( #25654 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 10:57:36 +00:00
17b4c6685c
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling ( #25648 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 18:36:01 +08:00
3c2b2ccece
[Bugfix] Add triton.language.tensor placeholder ( #25649 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-09-25 10:31:14 +00:00
7be9ffcd9f
[Misc] Fix Qwen3-VL video_grid_thw typing ( #25646 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-25 10:16:45 +00:00
393de22d2e
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin ( #25579 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-09-25 09:39:18 +00:00
1260180c67
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… ( #25607 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-25 08:05:21 +00:00
af4ee63e0e
typo: remove duplicate is ( #25641 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
2025-09-25 00:46:22 -07:00
bc092ea873
Map CwmForCausalLM to llama and LlamaForCausalLM ( #25611 )
...
Signed-off-by: Jacob Kahn <jacobkahn1@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-25 07:37:03 +00:00
755ed7b05b
[Misc] Simplify PoolerOutput and move to v1/outputs ( #25629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 06:47:03 +00:00
a676e668ee
[Bugfix] fix apply_temperature to avoid nan in probs ( #24734 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-09-25 05:32:21 +00:00
c85be1f6dd
optimize: eliminate duplicate split_enc_dec_inputs calls ( #25573 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
2025-09-25 05:03:25 +00:00
845adb3ec6
[Model] Add LongCat-Flash ( #23991 )
...
Signed-off-by: yangxurui <yangxurui@meituan.com >
Co-authored-by: yangxurui <yangxurui@meituan.com >
2025-09-24 21:53:40 -07:00
90b139cfff
Enable Fbgemm NVFP4 on Dense models ( #25609 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-24 21:12:53 -07:00
4492e3a554
[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super() ( #25613 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 18:52:52 -07:00
05c19485a5
[Kernel] Support DCP for Triton backend ( #25132 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-09-24 18:09:34 -07:00
52d0cb8458
[Model] Improve DotsOCRForCausalLM ( #25466 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-25 07:58:08 +08:00
5c1e496a75
[MISC] replace c10::optional with std::optional ( #25602 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-24 16:56:21 -07:00
e7f27ea648
Improve --help for enhanced user experience ( #24903 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 23:08:18 +00:00
1f29141258
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor ( #25517 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 18:52:36 -04:00
6160ba4151
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel ( #25503 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-09-24 18:50:04 -04:00
fea8006062
[Logging] Improve log for when DeepEP HT disables CUDA Graphs ( #25531 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-24 22:43:06 +00:00
e6750d0b18
[V0 Deprecation] Remove unused classes in attention ( #25541 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-24 13:24:40 -07:00
8c853050e7
[Docs] Enable fail_on_warning for the docs build in CI ( #25580 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 19:30:33 +00:00
f84a472a03
Suppress benign cuBLAS warning when capturing cudagraphs with DBO ( #25596 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-24 19:02:08 +00:00
54e42b72db
Support mnnvl all2allv from Flashinfer ( #21003 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-24 14:38:16 -04:00
2dda3e35d0
[Bugfix] add cache model when from object storage get model ( #24764 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-24 18:11:16 +00:00
d83f3f7cb3
Fixes and updates to bench_per_token_quant_fp8 ( #25591 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 08:30:15 -07:00
302eb941f3
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order ( #25415 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-24 11:25:10 -04:00
487745ff49
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled ( #25275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-24 11:24:39 -04:00
9313be5017
[Misc] Improve type annotations for jsontree ( #25577 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-24 22:49:58 +08:00
8938774c79
Move DeviceConfig, ObservabilityConfig, SpeechToTextConfig to their own files ( #25564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 13:59:05 +00:00
e18b714b2e
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output ( #25405 )
...
Signed-off-by: taohui <taohui3@gmail.com >
2025-09-24 20:58:00 +08:00
b1068903fd
[docs] fix nixl kv_connector_extra_config.backends key ( #25565 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 11:00:27 +00:00
164299500b
[Benchmark] Fix regression in structured output benchmark ( #25500 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-24 10:40:42 +00:00
58c360d9be
[Bug] fix import and unit test ( #25558 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
2025-09-24 10:17:59 +00:00
42488dae69
[Bugfix] Fix dummy video number of frames calculation ( #25553 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-24 09:47:30 +00:00
b67dece2d8
[misc] update the warning message ( #25566 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-24 17:24:35 +08:00
2338daffd3
[BugFix] Potential Fix for FA3 full-cudagraph IMA ( #25490 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-24 02:04:04 -07:00
2e19a848d4
[V0 Deprecation] Remove max_seq_len_to_capture ( #25543 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-24 01:51:39 -07:00
77a7fce1bb
[CI/Build] add nightly prime-rl integration tests ( #25207 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 08:44:22 +00:00
6488f3481b
[Misc]] Move processing context to multimodal directory ( #25548 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-24 08:15:00 +00:00
27ec3c78f3
[CI/Build] Fix v1 OOT registration test ( #25547 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-24 08:03:13 +00:00
1cbcfb94de
[Bugfix][CPU] Skip unsupported custom op register on CPU ( #25534 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-24 06:21:51 +00:00
fed8a9b107
[Misc] Retry HF processing if "Already borrowed" error occurs ( #25535 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 22:32:11 -07:00
190c45a6af
[TPU][Bugfix] fix the missing apply_model in tpu worker ( #25526 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-09-24 05:18:08 +00:00
5caaeb714c
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls ( #25514 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-09-24 03:20:38 +00:00
d747c2ef18
[Perf] Fix jit compiles at runtime of fla gated delta rule ( #25432 )
...
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 11:16:13 +08:00
c30b405b8f
[Spec Decode] Enable FlashInfer Spec Decoding ( #25196 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Co-authored-by: lhsjohn <huashuoli@tencent.com >
2025-09-23 22:29:58 -04:00
77d906995c
[KV sharing] Re-land Gemma3n model changes from #22628 ( #24357 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-23 19:25:34 -07:00
359d293006
[fix]: add Arm 4bit fused moe support ( #23809 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-09-24 01:32:22 +00:00
9df8da548e
[BugFix] Fix MLA assert with CUTLASS MLA ( #25478 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 21:09:43 -04:00
bf68fd76a9
[Compile] Fix AMD Compile Error ( #25518 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 00:42:48 +00:00
de94289a98
[Core] Support weight_loader_v2 for UnquantizedLinearMethod ( #23036 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-23 18:30:26 -06:00
1983609239
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen ( #25520 )
2025-09-24 00:19:56 +00:00
d06b5a95cb
[V1][Metrics] Add per-request TPOT histogram ( #24015 )
...
Signed-off-by: baxingpiaochong <771405853@qq.com >
2025-09-23 18:19:04 -06:00
be0bb568c9
[Model] Support SeedOss Reason Parser ( #24263 )
...
Signed-off-by: Yan Lu <luyan@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 18:15:51 -06:00
c8bde93367
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used together ( #24922 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-09-23 18:13:32 -06:00
88d7bdbd23
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' ( #25519 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 00:07:51 +00:00
0d235b874a
Add CUTLASS FP8 MOE benchmark scripts and kernel config ( #25302 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
2025-09-23 18:07:42 -06:00
7ad5e50adf
Improve output when failing json.loads() on structured output test ( #25483 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-09-23 18:03:31 -06:00
dc464a3d39
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch ( #25505 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 18:00:29 -06:00
1210e4d95b
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 ( #25509 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-09-23 16:57:55 -07:00
e0b24ea030
[Perf] Increase default max splits for FA3 full cudagraphs ( #25495 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 16:53:34 -07:00
bde2a1a8a4
[ROCm] Small functional changes for gptoss ( #25201 )
...
Signed-off-by: jpvillam <jpvillam@amd.com >
Co-authored-by: jpvillam <jpvillam@amd.com >
2025-09-23 23:39:50 +00:00
5e25b12236
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel ( #25197 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
2025-09-23 23:23:30 +00:00
c85d75cf08
Add VLLM_NVTX_SCOPES_FOR_PROFILING=1 to enable nvtx.annotate scopes ( #25501 )
...
Signed-off-by: Corey Lowman <clowman1993@gmail.com >
2025-09-23 22:50:09 +00:00
abad204be6
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting ( #25359 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-09-23 15:49:09 -07:00
7361ab379f
Remove redundant mutates_args and dispatch_key for direct_register_custom_op ( #25512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 22:48:40 +00:00
95bc60e4cb
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI ( #25428 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-23 15:46:46 -07:00
4f2954f724
Fix triton_reshape_and_cache_flash.py triton import ( #25522 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 15:26:10 -07:00
eca7be9077
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… ( #25493 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
2025-09-23 22:17:49 +00:00
969b4da3a6
[V0 Deprecation] Remove placeholder attn ( #25510 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-23 22:12:14 +00:00
4f8c4b890a
[Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] ( #24830 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-09-23 15:11:14 -07:00
ae002924e9
[CI/Build] Fix and re-enable v1 PP test on CI ( #25496 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 21:58:25 +00:00
690f948e4a
[Bugfix] Fix for the import error from #24588 ( #25481 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-23 21:31:08 +00:00
08275ec0a2
[Build] Update Xgrammar to 0.1.25 ( #25467 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-23 21:25:46 +00:00
c828d1bf98
[Bugfix] gpt-oss container tool output bug ( #25485 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2025-09-23 20:43:45 +00:00
8b8a8afc89
[CI] Fix Pre-commit Issue ( #25497 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 04:09:37 +08:00
8bdd8b5c51
Enable symmetric memory all reduce by default only enabling for TP ( #25070 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 15:53:00 -04:00
a8ffc4f0f2
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 ( #25508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 12:49:55 -07:00
d5944d5146
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue ( #25406 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-09-23 15:44:35 -04:00
24fab45d96
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE ( #25444 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 15:29:26 -04:00
63400259d0
[Performance] Move apply_w8a8_block_fp8_linear to an op class ( #24666 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-09-23 12:03:10 -07:00
8c1c81a3de
[core] add nccl symmetric memory for all reduce ( #24532 )
...
Signed-off-by: Amir Samani <asamani@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 14:33:06 -04:00
a3a7828010
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 ( #24988 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com >
2025-09-23 14:31:45 -04:00
5abb117901
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank ( #25487 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-23 18:19:25 +00:00
867ecdd1c8
[Spec Decode][CI] Add e2e test for examples/spec_decode.py and prevent breaking Acceptance Length ( #24531 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-23 10:46:40 -07:00
24e8222745
[Misc] Reduce initialization time of auto_tune ( #23682 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-23 17:34:58 +00:00
100b630a60
[V1][Kernel] Add triton implementation for reshape_and_cache_flash ( #24503 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-23 12:52:40 -04:00
527821d191
Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu ( #25346 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-23 09:45:39 -07:00
846197f505
[Log] Optimize kv cache memory log from Bytes to GiB ( #25204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-23 12:44:37 -04:00
2357480b1a
[BugFix] Fix UB in per_token_group_quant.cu ( #24913 )
...
Signed-off-by: Shreeasish Kumar <shreeasish@rivosinc.com >
2025-09-23 09:14:22 -07:00
f11e3c516b
[Kernels] Support blocked fp8 quantization for compressed tensors MoE ( #25219 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 16:11:34 +00:00
875d6def90
Add backward compatibility for GuidedDecodingParams ( #25422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-23 17:07:30 +01:00
cc1dc7ed6d
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support ( #24845 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-23 16:02:10 +00:00
a903669e10
[V1] Remove V0 code paths for Hybrid models ( #25400 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-23 08:26:13 -07:00
2c58742dff
[UX] Change kv-cache-memory log level to debug ( #25479 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 08:01:24 -07:00
4c966e440e
[XPU] Fix MOE DP accuracy issue on XPU ( #25465 )
2025-09-23 14:32:57 +00:00
da5e7e4329
[Docs] NixlConnector quickstart guide ( #24249 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-09-23 14:23:22 +00:00
f05a4f0e34
[P/D] Support NIXL connector to disconnect during a clean shutdown ( #24423 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-09-23 16:08:02 +02:00
61d1b35561
[BugFix] Register expert_map as named buffer for wake_up and sleep ( #25458 )
...
Signed-off-by: wuxibin <wuxibin@bytedance.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-23 21:49:13 +08:00
b6a136b58c
[CI/Build] Fix disabled v1 attention backend selection test ( #25471 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 13:05:46 +00:00
0d9fe260dd
[docs] Benchmark Serving Incorrect Arg ( #25474 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-23 06:05:11 -07:00
273690a50a
[Core] Optimize LoRA weight loading ( #25403 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-23 18:19:45 +08:00
231c2c63e4
[Bugfix] Fix idefics3 tie_word_embeddings ( #25454 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 10:06:48 +00:00
4322c553a6
[Test]: Hermes tool parser stream output error in Qwen3 case ( #25203 )
...
Signed-off-by: Andreas Hartel <andreas.hartel@aleph-alpha.com >
2025-09-23 17:56:31 +08:00
babad6e5dd
[Misc] Move DP for ViT code inside model executor dir ( #25459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 09:20:52 +00:00
9383cd6f10
[Frontend] Add a new xml-based tool parser for qwen3-coder ( #25028 )
...
Signed-off-by: Zhikaiiii <1658973216@qq.com >
2025-09-23 16:07:27 +08:00
ba8d2165b6
Handle triton kernel import exception ( #25319 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-23 00:56:00 -07:00
c98be0a232
[Model] Enable DP for ViT in Qwen2-VL ( #25445 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 05:17:10 +00:00
5774b0a1da
[NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend ( #25121 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-23 04:17:42 +00:00
e8db44f883
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP ( #24588 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-09-22 21:01:09 -07:00
fafbe11af4
[Docs] Fix griffe warnings in vllm/lora/ops ( #25369 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-23 03:42:58 +00:00
78237e43bf
[Bugfix] Remove contiguous output req for context parallel MLA ( #25414 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-22 20:26:32 -07:00
eea1783989
[benchmarks]allow skip ready check for bench serve ( #25420 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-09-23 03:21:48 +00:00
f225ea7dd9
[XPU] Fix compile_size is None case. ( #25433 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-23 03:09:00 +00:00
fc97733da8
[feat] Support MRoPE + YaRN ( #25384 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
2025-09-23 03:04:47 +00:00
4741239db7
[Bug] Fix Long Context OOM Issue ( #25290 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-22 22:04:15 -04:00
c625f9043c
[V0 deprecation] Remove _set_default_args_v0 function ( #25409 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 01:52:09 +00:00
6fa78d8f23
[V0 deprecation] Remove platform v1 controling interface ( #25410 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 01:48:12 +00:00
9949aa2ef1
[Perf] Apply torch.compile for per_block_cast_to_fp8 ( #24611 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-22 19:42:45 -06:00
0b7bed9c38
[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling ( #25184 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-09-22 19:20:53 -06:00
ac0048c0ae
[BugFix] [DP/EP] Fix slow execution when BS <= DP ( #25407 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com >
2025-09-22 17:26:17 -07:00
090197034f
[Bugfix] Fix missing clear_connector_metadata ( #25397 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-23 08:10:59 +08:00
f31ff87460
[Core] Drop overly aggressive whisper assertion ( #25408 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-22 17:09:52 -07:00
d588cd2406
[Bugfix] fix custom op test ( #25429 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-09-23 00:07:43 +00:00
45d7d852d3
[Frontend] Responses API MCP tools for built in tools and to pass through headers ( #24628 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Signed-off-by: Alec S <10566873+alecsolder@users.noreply.github.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-22 23:38:19 +00:00
8bed179109
[TPU] update torch_xla dependency for PyPI compatibility ( #25278 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-09-22 16:14:44 -07:00
f552d5e578
[CI/Build] Skip Qwen3-VL initialization tests until models are actually released ( #25394 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 13:18:24 -07:00
8db2939289
[KV offload][5/N] Add CPUOffloadingSpec ( #24251 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-22 12:30:36 -07:00
d5e0fca264
[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug ( #23091 ), fix test ( #24376 ), and prep for custom op matching ( #24604 ) ( #24542 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-22 12:30:05 -07:00
8d0ee5a564
[misc] Remove RFC review hours reference ( #25416 )
2025-09-22 12:16:59 -07:00
922979bfcc
[DP] support torchrun external launcher with Data Parallelism ( #24899 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-22 12:06:05 -07:00
239ef0c1ac
[CI Failure] Fix fp8 kv cache on <SM90 ( #25396 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-22 18:27:51 +00:00
1d7f95b85c
[Compiler] Disable Inductor standalone compile by default ( #25391 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-09-22 17:37:46 +00:00
cfbee3d0e7
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables ( #25274 )
...
Signed-off-by: qqma <qqma@amazon.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: qqma <qqma@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-22 10:37:43 -07:00
06a41334c7
[EPLB] Reduce EPLB Inference Overhead ( #24573 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-22 16:31:05 +00:00
175811e3b5
[V1][Attention] Split triton_attn in triton-only and rocm specific backends ( #24648 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-22 15:20:28 +00:00
c10101a3eb
[Bugfix] Fix several issues with p2p xPyD in GET type ( #23993 )
...
Signed-off-by: Csrayz <jover@cmbchina.com >
Signed-off-by: ivyilike <pww123@cmbchina.com >
Co-authored-by: ivyilike <pww123@cmbchina.com >
2025-09-22 14:53:13 +00:00
ac243886b0
[Kernel] MI-300X triton moe configs ( #23445 )
...
Signed-off-by: Sara Kokkila Schumacher <saraks@ibm.com >
2025-09-22 14:29:54 +00:00
3d2c56b7a9
Make mypy behave like a proper pre-commit hook ( #25313 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-22 12:23:45 +00:00
64c824cd78
Make pickle import check fast ( #25379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-22 04:08:25 -07:00
417a164af6
[Misc] Remove unused encoder-decoder error strings ( #25374 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 11:04:32 +00:00
b6f01bd9a7
refactor: abstract graph mode support into platform interface ( #25161 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-09-22 10:22:29 +00:00
4cf71cc88a
[TPU] Deprecate xm.mark_step in favor of `torch_xla.sync ( #25254 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-22 10:12:57 +00:00
a66d131381
[TPU][Bugfix][CI] Fix broken tests/build dependency ( #25255 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-22 09:55:04 +00:00
21467f9a1c
Enable Eagle3 speculative decoding for GPT-OSS model ( #25246 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
2025-09-22 08:50:39 +00:00
f92d952632
[V0 Deprecation] Remove MultiModalPlaceholderMap ( #25366 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 08:49:19 +00:00
6d0b827cbd
[V0 Deprecation] Remove V0-only methods in multi-modal registry ( #25362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 13:58:26 +08:00
0eecb31663
[Bugfix] Fix hermes tool parser handling of non-string argument types ( #22002 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-22 11:35:39 +08:00
793be8d057
[Docs] GSM8K Accuracy Evaluation doc update ( #25360 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-22 02:49:13 +00:00
7b57a433da
[Model] Support Dots OCR ( #24645 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: yinz-aizip <yinz@aizip.ai >
2025-09-22 02:24:40 +00:00
5aeb925452
Multimodal - audio tests ( #25285 )
...
Signed-off-by: Debolina Roy <debroy@redhat.com >
2025-09-22 07:07:11 +08:00
04d3752329
[Bugfix][V0 Deprecation][CI] use async mock and await for async method ( #25325 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2025-09-22 07:06:16 +08:00
bc6e542d9f
Remove V0 attention backends ( #25351 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-21 16:03:28 -07:00
af7dfb0d1a
[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate ( #25347 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-21 20:12:45 +00:00
1c3ffdbecc
[V0 Deprecation] Remove V0 sampling metadata ( #25345 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-21 10:37:11 -07:00
c438b2951c
feat: Enable engine-level arguments with speculators models ( #25250 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-09-21 11:04:45 -06:00
0ff8ebb2d7
[V0 Deprecation] Remove async_output_proc, preemption mode, delay factor ( #25334 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-21 08:52:32 -07:00
26e673fe93
[V0 Deprecation] Remove V0 Sequence class & Sampler ( #25332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-21 08:52:15 -07:00
65a5910ce3
[Optimization] Cache chat template result when processor fails to be loaded ( #25341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-21 19:41:02 +08:00
9aea7373ff
[Bugfix] Typos in error message for missing model config file ( #25339 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2025-09-21 04:36:47 -07:00
30d08911f7
[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate ( #25337 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-21 11:05:20 +00:00
cf56cf78b4
[V1] Add sliding window support to Flex Attention backend ( #24089 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-21 05:08:07 +00:00
7ed82d1974
[V0 Deprecation] Remove V0 MP executor ( #25329 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 21:26:35 -07:00
12dbd834cf
[V0 Deprecation] Remove from_seq_group methods ( #25330 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 21:10:48 -07:00
035fd2bd2c
[Multi Modal][Performance] Fused Q,K's apply_rope in more models ( #25005 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-21 03:55:10 +00:00
1cd885bd54
[V0 Deprecation] Remove V0 model runner base & simplify worker base ( #25328 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 20:49:09 -07:00
62b38dc832
[Doc] improve test-pipeline.yaml documentation ( #25305 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-20 20:29:12 -07:00
c99db8c8dd
[V0 Deprecation] Remove V0 core ( #25321 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:58:26 -07:00
72dd1595b4
[CI] Skip tests failing on main ( #25326 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:57:46 -07:00
572ddf83ce
[Chore] Remove unused sampler in models ( #25324 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:53:20 -07:00
86647d1cd0
[V0 Deprecation] Remove V0 Output Processor ( #25320 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 17:57:20 -07:00
52c2a8d4ad
[V0 Deprecation] Remove LLMEngine ( #25033 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 17:56:30 -07:00
367a480bd3
[Docs] Fix warnings in vllm/profiler and vllm/transformers_utils ( #25220 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-20 16:39:47 -07:00
bef180f009
[V0 Deprecation] Enable the remaining multimodal tests in V1 ( #25307 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 17:50:58 +00:00
d88918e4c2
[Core] Enable sharded state loader for V1 engine and enhance test coverage ( #25308 )
...
Signed-off-by: pengdrumli <pengdrumli@tencent.com >
2025-09-20 21:15:22 +08:00
3c713a9711
[Model] Cleanup InternViT's data parallel implementation ( #25306 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-20 05:46:24 -07:00
bf8b26cad1
Generate _ModelInfo properties file when loading to improve loading speed ( #23558 )
...
Signed-off-by: Manoel Marques <manoel.marques@ibm.com >
Signed-off-by: Manoel Marques <manoelmrqs@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-20 11:51:13 +00:00
032d661d27
[Docs] Fix warnings in mkdocs build (continued) ( #25042 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-20 11:45:18 +00:00
e08a3a3fdb
[CI Failure] Disable FlashInfer RoPE to unblock CI ( #25299 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-20 08:16:56 +00:00
3d9a1d2de5
[V1] Support LLM.apply_model ( #18465 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 07:14:35 +00:00
be874c0201
[Bugfix] Fix Qwen3-VL-MoE weight loading for EP ( #25300 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-20 00:04:05 -07:00
9607d5eb44
[Hybrid Allocator] Support full attention with different hidden size ( #25101 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-19 23:43:59 -07:00
c60e6137f0
[Optimization] Avoid repeated model architecture conversion for pooling models ( #25261 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 13:30:22 +08:00
f91480b2d4
[Bugfix] fix tool call arguments is empty ( #25223 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: xin.li <xin.li@daocloud.io >
2025-09-20 13:29:54 +08:00
6c5f82e5aa
[BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention ( #25298 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-20 04:41:23 +00:00
b7f186bbb3
[BugFix] Exclude self when checking for port collision ( #25286 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-20 12:28:31 +08:00
3642909617
[BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) ( #25268 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-09-20 11:18:13 +08:00
c308501cb6
Improve weight loading for encoder models in Transformers backend ( #25289 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-20 03:11:03 +00:00
535d80056b
[Misc] Support more collective_rpc return types ( #25294 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-20 02:02:38 +00:00
a25ade5d47
[BugFix] Ensure appropriate guards in destructors ( #25284 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-20 09:06:34 +08:00
8945b001db
[torch.compile] CUDAGraph Inductor partition integration ( #24281 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Boyuan Feng <fby.1994@gmail.com >
Signed-off-by: boyuanfeng <boyuan@meta.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-20 01:02:15 +00:00
b8a287a0a8
[docs] Prompt Embedding feature support ( #25288 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-19 17:46:23 -07:00
c7e713616a
test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support ( #25291 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-19 17:33:40 -07:00
a36c675817
Don't skip special tokens with hermes-style tool calling ( #25281 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-19 17:33:25 -07:00
3da17c2cc2
[Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE #2969 ( #25090 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2025-09-19 20:27:21 -04:00
14c1432789
[BugFix] Fix async scheduling CPU tensor race take 2 ( #25279 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-19 16:34:07 -07:00
ee7a66dd9a
allow disable flashinfer prefill ( #25276 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 22:59:41 +00:00
431535b522
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust ( #22771 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 22:40:33 +00:00
711e912946
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM ( #25193 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-19 16:23:19 -06:00
e69e0b8b5f
[Frontend] Responses API messages out, just harmony for now ( #24985 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-19 21:40:16 +00:00
ddc9048394
Fix: Correct FusedMoE layer reference in auto_round quantization ( #24818 )
...
Signed-off-by: David-Wen <18927700430@163.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-19 20:44:24 +00:00
b1a63d1b3b
[BugFix] Make FlashInferMetadataBuilder non-blocking ( #25040 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 20:36:34 +00:00
48ecb4438b
[Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available ( #21126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-19 14:06:49 -06:00
e57fc15971
Specify platform in pip-compile pre-commit hook so it runs on MacOS ( #25273 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 12:43:33 -07:00
4bdf400218
[Bugfix] Fix chunked a2_scales in modular kernels ( #25264 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-19 19:42:01 +00:00
7852b82b93
[Bugfix] GPT OSS Attritbute error on H100 ( #25228 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-09-19 13:14:09 -06:00
a2a5f79e09
Optimize triton unified attention performance for sliding window attention ( #24390 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
2025-09-19 13:07:26 -06:00
c59a0eca42
[KV offload][4/N] Offloading KV connector ( #22595 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 19:07:17 +00:00
b716ab93a7
[bugfix] fix structured outputs key missing issue from #24929 ( #25195 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 18:37:57 +00:00
138f0d1e75
[Docs] add __init__.py to vllm/model_executor/layers/quantization/compressed_tensors/transform ( #24974 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 18:32:27 +00:00