Commit Graph

2014 Commits

Author SHA1 Message Date
2188a60c7e [Misc] Update GPTQ to use vLLMParameters (#7976) 2024-09-03 17:21:44 -04:00
6d646d08a2 [Core] Optimize Async + Multi-step (#8050) 2024-09-03 18:50:29 +00:00
6e36f4fa6c improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
2024-09-02 14:20:12 -07:00
e6a26ed037 [SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) 2024-09-01 21:23:29 -07:00
f8d60145b4 [Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-09-01 18:37:18 -07:00
5b86b19954 [Misc] Optional installation of audio related packages (#8063) 2024-09-01 14:46:57 -07:00
5231f0898e [Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
622f8abff8 [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) 2024-08-30 22:18:50 -07:00
1248e8506a [Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
058344f89a [Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
2024-08-30 08:21:02 -07:00
f97be32d1d [VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-30 08:19:27 -07:00
428dd1445e [Core] Logprobs support in Multi-step (#7652) 2024-08-29 19:19:08 -07:00
4abed65c58 [VLM] Disallow overflowing max_model_len for multimodal models (#7998) 2024-08-29 17:49:04 -07:00
4664ceaad6 support bitsandbytes 8-bit and FP4 quantized models (#7445) 2024-08-29 19:09:08 -04:00
6b3421567d [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
3f60f2244e [Core] Combine async postprocessor and multi-step (#7921) 2024-08-29 11:18:26 -07:00
f205c09854 [Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) 2024-08-28 22:18:13 -07:00
ef99a78760 Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) 2024-08-28 21:27:06 -07:00
74d5543ec5 [VLM][Core] Fix exceptions on ragged NestedTensors (#7974) 2024-08-29 03:24:31 +00:00
a7f65c2be9 [torch.compile] remove reset (#7975) 2024-08-28 17:32:26 -07:00
ce6bf3a2cf [torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
fdd9daafa3 [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) 2024-08-28 15:06:52 -07:00
e5697d161c [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) 2024-08-28 15:37:47 -04:00
b98cc28f91 [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
e3580537a4 [Performance] Enable chunked prefill and prefix caching together (#7753) 2024-08-28 00:36:31 -07:00
51f86bf487 [mypy][CI/Build] Fix mypy errors (#7929) 2024-08-27 23:47:44 -07:00
fab5f53e2d [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) 2024-08-28 01:53:56 +00:00
5340a2dccf [Model] Add multi-image input support for LLaVA-Next offline inference (#7230) 2024-08-28 07:09:02 +08:00
fc911880cc [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
9db642138b [CI/Build][VLM] Cleanup multiple images inputs model test (#7897) 2024-08-27 15:28:30 +00:00
6fc4e6e07a [Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) 2024-08-27 12:40:02 +00:00
64cc644425 [core][torch.compile] discard the compile for profiling (#7796) 2024-08-26 21:33:58 -07:00
39178c7fbc [Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
2eedede875 [Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
665304092d [Misc] Update qqq to use vLLMParameters (#7805) 2024-08-26 13:16:15 -06:00
2deb029d11 [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
029c71de11 [CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) 2024-08-26 05:31:10 +00:00
0b769992ec [Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
1856aff4d6 [Spec Decoding] Streamline batch expansion tensor manipulation (#7851) 2024-08-25 15:45:14 -07:00
2059b8d9ca [Misc] Remove snapshot_download usage in InternVL2 test (#7835) 2024-08-25 15:53:09 +00:00
8aaf3d5347 [Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783) 2024-08-25 11:51:20 +00:00
80162c44b1 [Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840) 2024-08-24 18:16:24 -07:00
aab0fcdb63 [ci][test] fix RemoteOpenAIServer (#7838) 2024-08-24 17:31:28 +00:00
ea9fa160e3 [ci][test] exclude model download time in server start time (#7834) 2024-08-24 01:03:27 -07:00
7d9ffa2ae1 [misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
d81abefd2e [Frontend] add json_schema support from OpenAI protocol (#7654) 2024-08-23 23:07:24 -07:00
8da48e4d95 [Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
9db93de20c [Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
f1df5dbfd6 [Misc] Update marlin to use vLLMParameters (#7803) 2024-08-23 14:30:52 -04:00
e25fee57c2 [BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00