ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-03-02 00:50:01 -08:00
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com >
Co-authored-by: alexm <alexm@neuralmagic.com >
2024-03-01 12:47:51 -08:00
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-02-29 22:13:08 +00:00
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-02-27 01:14:31 -08:00
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
ef978fe411
Port metrics from aioprometheus to prometheus_client ( #2730 )
2024-02-25 11:54:00 -08:00
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens ( #2802 )
2024-02-22 14:00:12 -08:00
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
017d9f1515
Add metrics to RequestOutput ( #2876 )
2024-02-20 21:55:57 -08:00
63e2a6419d
[FIX] Fix beam search test ( #2930 )
2024-02-20 14:37:39 -08:00
e433c115bc
Fix vllm:prompt_tokens_total metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
d7afab6d3a
[BugFix] Fix GC bug for LLM class ( #2882 )
2024-02-14 22:17:44 -08:00
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure ( #2750 )
2024-02-06 11:38:38 -08:00
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
96b6f475dd
Remove hardcoded device="cuda" to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-02-01 15:46:39 -08:00
d0d93b92b1
Add unit test for Mixtral MoE layer ( #2677 )
2024-01-31 14:34:17 -08:00
89efcf1ce5
[Minor] Fix test_cache.py CI test failure ( #2684 )
2024-01-31 10:12:11 -08:00
4f65af0e25
Add swap_blocks unit tests ( #2616 )
2024-01-30 09:30:50 -08:00
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com >
2024-01-29 21:19:48 -08:00
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-28 16:43:54 -08:00
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
3a7dd7e367
Support Batch Completion in Server ( #2529 )
2024-01-24 17:11:07 -08:00
3209b49033
[Bugfix] fix crash if max_tokens=None ( #2570 )
2024-01-23 22:38:55 -08:00
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com >
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-01-23 15:26:37 -08:00
7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py ( #2553 )
2024-01-22 14:47:25 -08:00
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model ( #2424 )
2024-01-21 16:31:47 -08:00
ef9b636e2d
Simplify broadcast logic for control messages ( #2501 )
2024-01-19 11:23:30 -08:00
dd7e8f5f64
refactor complemention api for readability ( #2499 )
2024-01-18 16:45:14 -08:00
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-17 16:32:10 -08:00
14cc317ba4
OpenAI Server refactoring ( #2360 )
2024-01-16 21:33:14 -08:00
e1957c6ebd
Add StableLM3B model ( #2372 )
2024-01-16 20:32:40 -08:00
6e01e8c1c8
[CI] Add Buildkite ( #2355 )
2024-01-14 12:37:58 -08:00
218dc2ccda
Aligning top_p and top_k Sampling ( #1885 )
...
* Align top_p and top_k with huggingface
* remove _get_prompt_and_output_tokens
* rename _apply_top_p_top_k
* compare top_p top_k with hf
* fix test errors
2024-01-12 22:51:03 +01:00
79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler ( #2336 )
2024-01-09 15:38:41 -08:00
941767127c
Revert the changes in test_cache ( #2335 )
2024-01-03 17:32:05 -08:00
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels ( #1959 )
2024-01-02 19:09:59 -08:00
358c328d69
[BUGFIX] Fix communication test ( #2285 )
2023-12-27 17:18:11 -05:00
4aaafdd289
[BUGFIX] Fix the path of test prompts ( #2273 )
2023-12-26 10:37:21 -08:00