Commit Graph

  • f68f7ee030 Revert "[nit]: Fix import for the lmcache integration (#27600)" revert-27600-torch-utils-import Yihua Cheng 2025-10-28 13:46:05 -07:00
  • e179b705e9 Merge branch 'main' into wentao-refactor-batch-invariant-fp8-deepgemm Wentao Ye 2025-10-28 14:39:20 -04:00
  • 141e6a0505 [Misc] Make reorder batch also separate extends (#27367) Lucas Wilkinson 2025-10-29 01:55:10 +08:00
  • 130aa8cbcf Add load pattern configuration guide to benchmarks (#26886) Matvei Pashkovskii 2025-10-28 19:49:15 +02:00
  • e3d8186666 [compile] Add fallback path to AOT compile when serialization fails. (#27350) Zhengxu Chen 2025-10-28 12:54:26 -04:00
  • f5710ef02a [Misc] Make LayerBlockType a Literal instead of Enum (#27658) Cyrus Leung 2025-10-29 00:23:35 +08:00
  • a8c02fb5bf [Bugfix][CI] Fix v1 attention backend tests and add CI coverage (#26597) Mohammad Miadh Angkad 2025-10-28 23:42:05 +08:00
  • 02af36df36 [Bugfix] Fix allocation & free logic of SingleWriterShmRingBuffer (#27117) Kero Liang 2025-10-28 23:01:24 +08:00
  • e88bdd60d9 [FLA] Introduce Kimi Delta Attention(KDA) to VLLM (#27654) Zhiyuan Li 2025-10-28 22:56:28 +08:00
  • 05e034f085 [nit]: Fix import for the lmcache integration (#27600) Samuel Shen 2025-10-28 07:40:55 -07:00
  • 90978b2799 Merge branch 'main' into wentao-refactor-batch-invariant-fp8-deepgemm Wentao Ye 2025-10-28 10:30:12 -04:00
  • 936643a868 [BugFix] Also consider RAY_EXPERIMENTAL_NOSET_* when storing compilation cache (#27294) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-10-28 16:22:28 +02:00
  • b186149e8e [Bugfix][Frontend] validate arg priority in frontend LLM class before add request (#27596) Junpu Fan 2025-10-28 07:02:43 -07:00
  • 2abbd351ef [Core] Enable async scheduling for external_launcher mode (#27394) 22quinn 2025-10-28 06:52:47 -07:00
  • 446912d1cb fix: allow HuggingFace standard chat template params via **kwargs (#27622) wangln19 2025-10-28 21:12:34 +08:00
  • a00d6254e9 [compile] Disable dynamo guards check for AOT compilation. (#27288) Zhengxu Chen 2025-10-28 08:58:12 -04:00
  • 05181cc57f [Hybrid] Add mamba_block_size to Engine Args (#27289) Asaf Joseph Gardin 2025-10-28 14:54:24 +02:00
  • 259504e147 [compile] Add enable_prompt_embeds to compile hash. (#27285) Zhengxu Chen 2025-10-28 08:46:03 -04:00
  • 0484b64248 [Bug] Fix shape issue for eplb expert weights (#27589) Wentao Ye 2025-10-28 08:44:05 -04:00
  • f58d9b6404 [Misc] Separate out utils.counter and move utils.Device to engine (#27588) Cyrus Leung 2025-10-28 20:20:46 +08:00
  • 44b5ce956d [Bugfix] In LongRoPE, decide short vs long based on max_model_len (#27431) Matthew Bonanni 2025-10-28 08:00:56 -04:00
  • 7a865f2325 [V0 Deprecation] Remove vestigial V0 logits_processors.py file (#27601) Nick Hill 2025-10-28 04:17:45 -07:00
  • 2fa90bda27 Fix a robust parsing issue in KimiK2ToolParser that causes IndexError (#27565) wangln19 2025-10-28 19:11:50 +08:00
  • 0291fbf65c [CI/Build] Fix amd model executor test (#27612) Zhewen Li 2025-10-28 01:58:11 -07:00
  • b46e4a06f1 [Core][Bookkeeping Optimization] Update against numpy view of is_token_ids tensor (#27618) Jialin Ouyang 2025-10-28 01:13:10 -07:00
  • d34f5fe939 [Bugfix][CPU] Fallback oneDNN linear to torch linear to fix half gemm support on legecy platforms (#27526) Li, Jiang 2025-10-28 14:25:44 +08:00
  • bdb01a38fe [Hardware][AMD][Model] Triton MoE tuning configs for GLM-4.6 for MI300X (#27323) Eric Yue 2025-10-28 13:58:06 +08:00
  • 5b3c35a68e [ROCm] [Doc] Update ROCm installation docs (#27327) vllmellm 2025-10-28 13:00:50 +08:00
  • 61fbfe5274 [Bugfix] fixed inconsistent finish_reason handling between V0 and V1 engines (#27555) Chauncey 2025-10-28 10:18:08 +08:00
  • 255e34ca50 [Stability fix] turn off HMA allocator when connector is set (#27592) Kuntai Du 2025-10-27 18:32:23 -07:00
  • a8d2e326ec [Bugfix][CI] Fix config resolving logic with remote models (#27610) Roger Wang 2025-10-27 17:48:32 -07:00
  • 53a56e658b [gpt-oss][2/N] Support input_messages in responsesRequest (#26962) Andrew Xia 2025-10-27 16:15:49 -07:00
  • e5b7958d76 Refactor batch invariant fp8 deepgemm yewentao256 2025-10-27 13:23:00 -07:00
  • 69f064062b Code quality improvements: version update, type annotation enhancement, and enum usage simplification (#27581) usberkeley 2025-10-28 01:50:22 +08:00
  • 921e78f4bb [ROCm] Update AITER branch for ROCm base docker (#27586) Micah Williamson 2025-10-27 12:22:33 -05:00
  • b2f24cd6b7 add todo yewentao256 2025-10-27 09:52:17 -07:00
  • bc955355f8 Merge branch 'main' into wentao-batch-invariance-dp yewentao256 2025-10-27 09:32:27 -07:00
  • 6ebffafbb6 [Misc] Clean up more utils (#27567) Cyrus Leung 2025-10-27 23:30:38 +08:00
  • 3b96f85c36 [Chore]: Stream tokens vs characters in tool call parser tests (#26513) Ben Browning 2025-10-27 11:06:25 -04:00
  • 23ad820553 fixing mm placeholder replacement issue with gemma3 (#27538) tingtinggithub 2025-10-27 07:34:01 -07:00
  • 5d3be3ba4c [Bugfix][LoRA][FusedMoE] Select MxFP4 Backend based on LoRA Enablement (#27487) Varun Sundar Rabindranath 2025-10-27 10:32:50 -04:00
  • 4f882be4a0 [Model] Siglip2 Model Support (#27566) Yu Jiaqi 2025-10-27 21:57:37 +08:00
  • 9273754222 [Hybrid] Added supports_mamba_prefix_caching Protocol (#27339) Asaf Joseph Gardin 2025-10-27 15:05:20 +02:00
  • f4e8154076 [Kernel] Enable moe LoRA kernel support FP16 (#27468) Jee Jee Li 2025-10-27 19:48:37 +08:00
  • a663f6ae64 [cpu][perf] Fix low CPU utilization with VLLM_CPU_OMP_THREADS_BIND on AArch64 (#27415) Fadi Arafeh 2025-10-27 11:14:55 +00:00
  • a4fc21895e [Bugfix] Fixed when return_token_ids=False, the first event still contains prompt_token_ids. (#27561) Chauncey 2025-10-27 19:06:43 +08:00
  • a3e8611da5 [Bugfix] Limit the default value of max_model_len when it is not specified by users (#27556) Shanshan Shen 2025-10-27 18:16:20 +08:00
  • 7c2bdb83dc [Misc] Clean up utils (#27552) Cyrus Leung 2025-10-27 17:05:40 +08:00
  • 9932ed6a83 [Kernel] Adding split_K implementation for fused_moe_lora (#27291) Danielle Robinson 2025-10-27 02:05:24 -07:00
  • 2d631d28c6 [Doc] Slight improvement to M2 and beyond (#27554) Jee Jee Li 2025-10-27 17:02:10 +08:00
  • b368382964 [Model] Deprecate merge_by_field_config=False (#27551) Cyrus Leung 2025-10-27 16:43:00 +08:00
  • a806c14cc7 [Performance][LoRA] add context varying params to 'do_not_specialize' in fused moe lora (#27445) gnovack 2025-10-26 23:31:55 -07:00
  • 181bf5bbde [Docs] reemove the incorrect enable_reasoning parameter (#27550) yyzxw 2025-10-27 14:17:19 +08:00
  • cbd5e07a51 [Model] Use merge_by_field_config for MM models (Qwen series) (#27546) Cyrus Leung 2025-10-27 13:38:05 +08:00
  • 63b22e0dbb [Model][Bugfix] fix ernie45 moe 300B SharedFusedMoE output tuple (#27316) CSWYF3634076 2025-10-27 11:53:31 +08:00
  • 5980604c44 Fix MiniMax-M2 copyright (#27537) Roger Young 2025-10-27 11:29:51 +08:00
  • 361a7463d3 fix m2 test (#27536) youkaichao 2025-10-27 01:04:36 +08:00
  • 720af6ab79 [Model][MiniMax-M2] Support MiniMax-M2 Model (#27535) Roger Young 2025-10-27 00:59:11 +08:00
  • 55cba4a05c [CI/Build] Update causal-conv1d installation (#27529) Cyrus Leung 2025-10-26 22:14:22 +08:00
  • c7abff2990 Revert "[CI/Build] Use CPU for mm processing test on CI (#27522)" (#27531) Cyrus Leung 2025-10-26 19:44:27 +08:00
  • 71b1c8b667 [Chore]:Extract math and argparse utilities to separate modules (#27188) Yeshwanth N 2025-10-26 16:33:32 +05:30
  • 8fb7b2fab9 [Doc] Fix links to GH projects (#27530) Cyrus Leung 2025-10-26 17:55:51 +08:00
  • be7b55a83d [Doc] Remove Molmo warning (#27527) Cyrus Leung 2025-10-26 16:22:52 +08:00
  • 315b860abe [bugfix]fix empty prompts for async-engine mode in benchmark throughput (#27494) Lucia Fang 2025-10-26 01:16:35 -07:00
  • 87c41c26ad [Bugfix] Fix processor initialization for model from modelscope instead of HF (#27461) rongfu.leng 2025-10-26 15:44:31 +08:00
  • 65d2cf9511 [BUGFIX][ROCM] ViT FlashAttention on ROCm (no GFX9) and contiguous on qwen3vl ROCm TORCH_SDPA (#27190) JartX 2025-10-26 08:08:52 +01:00
  • d63cd9ff10 [CI/Build] Use CPU for mm processing test on CI (#27522) Isotr0py 2025-10-26 13:09:18 +08:00
  • 66a168a197 [CI/Build] Refactor processing tests (#27470) Cyrus Leung 2025-10-26 00:14:30 +08:00
  • a99564ac5b [Attention] Add missing kv cache scale setup (#27490) Matthew Bonanni 2025-10-25 03:12:49 -04:00
  • 4c5f632165 [Misc] Simplify max tokens in multimodal registry (#27500) Cyrus Leung 2025-10-25 14:56:01 +08:00
  • b853540388 [Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (#25712) Kuntai Du 2025-10-24 23:34:18 -07:00
  • 56ed7609a9 Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (#27502) Zhuohan Li 2025-10-24 22:31:43 -07:00
  • 29c9cb8007 [CI] Add tests for cudagraph (#27391) Jiangyun Zhu 2025-10-25 10:37:33 +08:00
  • f048f16ba7 fix pre-commit zhuohan/revert-26709 Zhuohan Li 2025-10-24 17:45:23 -07:00
  • 180880ddc3 Revert #26709 Zhuohan Li 2025-10-24 17:33:11 -07:00
  • 3e0a770c15 Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (#26709)" Zhuohan Li 2025-10-24 17:32:10 -07:00
  • 83f478bb19 [KVConnector] Migrate the LMCache integration code to be vLLM native (#25542) v0.11.1rc3 Yihua Cheng 2025-10-24 17:23:53 -07:00
  • 269c4db0a4 [Misc][DP] Guard mxfp4 implementation selection (#27484) Varun Sundar Rabindranath 2025-10-24 19:29:24 -04:00
  • 63bd5018a1 Revert "[Log] Optimize Startup Log (#26740)" revert-26740-wentao-optimize-startup-log-2 Wentao Ye 2025-10-24 19:27:07 -04:00
  • 52efc34ebf [Log] Optimize Startup Log (#26740) Wentao Ye 2025-10-24 19:27:04 -04:00
  • d95d0f4b98 [Distributed] Basic set of configuration for large EP deployment on GB200 (#27328) Pengchao Wang 2025-10-24 14:16:44 -07:00
  • 0402428200 [Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (#27455) Lehua Ding 2025-10-25 04:45:36 +08:00
  • 17af6aa0da [Document] Add ms-swift library to rlhf.md (#27469) jinghanhu 2025-10-25 04:31:50 +08:00
  • fc168c33f3 [CI/Build] Fix test_torch_utils in AMD CI (#27317) Zhewen Li 2025-10-24 12:26:00 -07:00
  • acc78aeb88 [Bugfix] Fix interns1-vit qk norm code path (#27480) Isotr0py 2025-10-25 01:43:45 +08:00
  • 0f67d4d962 [Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (#26397) Ming Yang 2025-10-24 10:24:08 -07:00
  • 7e1d697b56 [Bugfix] Fix MultiConnector stats reconstruction across process boundaries (#27366) kourosh hakhamaneshi 2025-10-24 10:08:05 -07:00
  • 699d62e6cf [NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (#27297) Chendi.Xue 2025-10-24 12:01:41 -05:00
  • cd390b609d [compile] Turn standalone_compile back on (#27460) Richard Zou 2025-10-24 09:30:27 -07:00
  • 2080b05099 [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472) Fadi Arafeh 2025-10-24 16:57:48 +01:00
  • 6454afec90 [Doc] Fix minor issues in docs/design/metrics.md (#27436) Lifans 2025-10-24 05:40:54 -07:00
  • 41a62564a7 Fix test named tool use (#27458) Chauncey 2025-10-24 20:27:45 +08:00
  • 284cc92275 [MISC] cudagraph_capture_sizes related improvements (#26016) fhl2000 2025-10-24 20:11:05 +08:00
  • 435be10db9 Fix AArch64 CPU Docker pipeline (#27331) ioana ghiban 2025-10-24 14:11:01 +02:00
  • b7030d962b [Benchmark] Enable benchmark to run with encoding_format="bytes" (#27467) Cyrus Leung 2025-10-24 19:16:50 +08:00
  • 3567816932 [Refactor] move tool parsing logic from protocol.py to the tool parser (#27383) Chauncey 2025-10-24 17:53:23 +08:00
  • e0ef8a2920 [BugFix] Fix torchrun DP with LLM class (#27395) 22quinn 2025-10-24 01:11:37 -07:00
  • 42efe609ba [MM][Bugfix] Replace PatchEmbed's conv3d to linear layer (#27418) Isotr0py 2025-10-24 15:32:47 +08:00
  • 88d3141ec6 [Docs] remove v1 column for embedding models (#27446) Yu Jiaqi 2025-10-24 14:55:03 +08:00
  • 09a6a49eaf [Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (#27443) Rui Qiao 2025-10-23 23:53:09 -07:00