Commit Graph

2445 Commits

Author SHA1 Message Date
8da48e4d95 [Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
6885fde317 [Bugfix] Fix run_batch logger (#7640) 2024-08-23 13:58:26 -07:00
9db93de20c [Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
09c7792610 Bump version to v0.5.5 (#7823) v0.5.5 2024-08-23 11:35:33 -07:00
f1df5dbfd6 [Misc] Update marlin to use vLLMParameters (#7803) 2024-08-23 14:30:52 -04:00
35ee2ad6b9 [github][misc] promote asking llm first (#7809) 2024-08-23 09:38:50 -07:00
e25fee57c2 [BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
faeddb565d [misc] Add Torch profiler support for CPU-only devices (#7806) 2024-08-23 05:46:25 +00:00
fc5ebbd1d3 [Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712) 2024-08-22 20:06:54 -07:00
c01a6cb231 [Ray backend] Better error when pg topology is bad. (#7584)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
b903e1ba7f [Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
a152246428 [Misc] fix typo in triton import warning (#7794) 2024-08-22 13:51:23 -07:00
666ad0aa16 [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args (#7705)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-22 20:10:55 +00:00
15310b5101 [Bugfix] Use LoadFormat values for vllm serve --load-format (#7784) 2024-08-22 11:37:08 -07:00
57792ed469 [Doc] Fix incorrect docs from #7615 (#7788) 2024-08-22 10:02:06 -07:00
d3b5b98021 [Misc] Enhance prefix-caching benchmark tool (#6568) 2024-08-22 09:32:02 -07:00
cc0eaf12b1 [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
955b5191c9 [Misc] update fp8 to use vLLMParameter (#7437) 2024-08-22 08:36:18 -04:00
55d63b1211 [Bugfix] Don't build machete on cuda <12.0 (#7757) 2024-08-22 08:28:52 -04:00
4f419c00a6 Fix ShardedStateLoader for vllm fp8 quantization (#7708) 2024-08-22 08:25:04 -04:00
a3fce56b88 [Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) 2024-08-22 02:42:24 -07:00
b3856bef7d [Misc] Use torch.compile for GemmaRMSNorm (#7642) 2024-08-22 01:14:13 -07:00
8c6f694a79 [ci] refine dependency for distributed tests (#7776) 2024-08-22 00:54:15 -07:00
eeee1c3b1a [TPU] Avoid initializing TPU runtime in is_tpu (#7763) 2024-08-21 21:31:49 -07:00
aae74ef95c Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) 2024-08-22 03:42:14 +00:00
cde9183b40 [Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
df1a21131d [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710) 2024-08-22 09:36:24 +08:00
7937009a7e [Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
9984605412 [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility (#7477)
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com>
2024-08-21 16:47:36 -07:00
7eebe8ccaa [distributed][misc] error on same VLLM_HOST_IP setting (#7756) 2024-08-21 16:25:34 -07:00
8678a69ab5 [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
5844017285 [ci] [multi-step] narrow multi-step test dependency paths (#7760) 2024-08-21 15:52:40 -07:00
1ca0d4f86b [Model] Add UltravoxModel and UltravoxConfig (#7615) 2024-08-21 22:49:39 +00:00
dd53c4b023 [misc] Add Torch profiler support (#7451)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-21 15:39:26 -07:00
970dfdc01d [Frontend] Improve Startup Failure UX (#7716) 2024-08-21 19:53:01 +00:00
91f4522cbf [multi-step] Raise error if not using async engine (#7703) 2024-08-21 11:49:19 -07:00
1b32e02648 [Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730) 2024-08-21 11:17:48 -07:00
f7e3b0c5aa [Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
d3c002eadc [Bugfix] chat method add_generation_prompt param (#7734) 2024-08-21 17:33:35 +00:00
9b73a2f498 [Spec Decoding] Use target model max length as default for draft model (#7706) 2024-08-22 00:23:22 +08:00
6925cdbeea [Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend (#7735) 2024-08-21 16:23:03 +00:00
53328d7536 [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) 2024-08-21 08:54:31 -07:00
c75363fbc0 [BugFix] Avoid premature async generator exit and raise all exception variations (#7698) 2024-08-21 11:45:55 -04:00
dd3fa0e430 [Bugfix] Mirror jinja2 in pyproject.toml (#7723) 2024-08-21 13:41:17 +00:00
baaedfdb2d [mypy] Enable following imports for entrypoints (#7248)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
4506641212 [Doc] Section for Multimodal Language Models (#7719) 2024-08-20 23:24:01 -07:00
12e1c65bc9 [Model] Add AWQ quantization support for InternVL2 model (#7187) 2024-08-20 23:18:57 -07:00
b74a125800 [ci] try to log process using the port to debug the port usage (#7711) 2024-08-20 17:41:12 -07:00
66a9e713a7 [Core] Pipe worker_class_fn argument in Executor (#7707) 2024-08-21 00:37:39 +00:00
9e51b6a626 [ci][test] adjust max wait time for cpu offloading test (#7709) 2024-08-20 17:12:44 -07:00