f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
d130b573a0
[Model] add rope_scaling support for qwen2 ( #4930 )
2024-05-21 05:22:22 +00:00
65ae8c2c8f
[Core] Fix scheduler considering "no LoRA" as "LoRA" ( #4897 )
2024-05-20 17:48:32 -07:00
c3af44722c
[Doc]Add documentation to benchmarking script when running TGI ( #4920 )
2024-05-20 20:16:57 +00:00
1937e29848
[Core] Sharded State Loader download from HF ( #4889 )
2024-05-20 11:46:12 -07:00
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-05-20 18:44:25 +00:00
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
2024-05-20 11:29:28 -07:00
546a97ef69
[Misc]: allow user to specify port in distributed setting ( #4914 )
2024-05-20 17:45:06 +00:00
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
6287537a0c
[Model] LLaVA model refactor ( #4910 )
2024-05-20 08:11:25 +00:00
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
c0724fc915
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used ( #4658 )
2024-05-18 05:09:11 +00:00
86b45ae065
[Bugfix] Relax tiktoken to >= 0.6.0 ( #4890 )
2024-05-17 12:58:52 -06:00
c5711ef985
[Doc] Update Ray Data distributed offline inference example ( #4871 )
2024-05-17 10:52:11 -07:00
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-16 16:37:29 -07:00
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net >
2024-05-16 11:16:09 -07:00
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00
f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput ( #4848 )
2024-05-16 10:02:56 -07:00
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-05-16 12:56:15 -04:00
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
dbc0754ddf
[docs] Fix typo in examples filename openi -> openai ( #4864 )
2024-05-17 00:42:17 +09:00
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cade Daniel <cade@anyscale.com >
2024-05-16 00:53:51 -07:00
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-05-15 22:11:54 -07:00
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-15 19:13:36 -04:00
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
361c461a12
[Doc] Highlight the fourth meetup in the README ( #4842 )
2024-05-15 11:38:49 -07:00
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig ( #4816 )
2024-05-15 07:22:09 -07:00
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5 .
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
29bc01bf3b
Add 4th meetup announcement to readme ( #4817 )
2024-05-14 18:33:06 -04:00
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com >
2024-05-14 10:38:59 -07:00
dc72402b57
[Bugfix][Doc] Fix CI failure in docs ( #4804 )
...
This PR fixes the CI failure introduced by #4798 .
The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.
I have also changed the format of the links to be more distinct from each other.
2024-05-15 01:57:08 +09:00
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00