Commit Graph

787 Commits

Author SHA1 Message Date
Roy
c1c0d00b88 Don't use cupy when enforce_eager=True (#3037) 2024-02-26 17:33:38 -08:00
Roy
d9f726c4d0 [Minor] Remove unused config files (#3039) 2024-02-26 17:25:22 -08:00
d6e4a130b0 [Minor] Remove gather_cached_kv kernel (#3043) 2024-02-26 15:00:54 -08:00
cfc15a1031 Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
70f3e8e3a1 Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
f7c1234990 [Fix] Fissertion on YaRN model len (#2984) 2024-02-23 12:57:48 -08:00
57f044945f Fix nvcc not found in vlm-openai image (#2781) 2024-02-22 14:25:07 -08:00
4caf7044e0 Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
6f32cddf1c Remove Flash Attention in test env (#2982) 2024-02-22 09:58:29 -08:00
c530e2cfe3 [FIX] Fix a bug in initializing Yarn RoPE (#2983) 2024-02-22 01:40:05 -08:00
fd5dcc5c81 Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
93dc5a2870 chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
95529e3253 Use Llama RMSNorm custom op for Gemma (#2974) 2024-02-21 18:28:23 -08:00
Roy
344020c926 Migrate MistralForCausalLM to LlamaForCausalLM (#2868) 2024-02-21 18:25:05 -08:00
5574081c49 Added early stopping to completion APIs (#2939) 2024-02-21 18:24:01 -08:00
d7f396486e Update comment (#2934) 2024-02-21 18:18:37 -08:00
8fbd84bf78 Bump up version to v0.3.2 (#2968)
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
v0.3.2
2024-02-21 11:47:25 -08:00
7d2dcce175 Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
dc903e70ac [ROCm] Upgrade transformers to v4.38.0 (#2967) 2024-02-21 09:46:57 -08:00
a9c8212895 [FIX] Add Gemma model to the doc (#2966) 2024-02-21 09:46:15 -08:00
c20ecb6a51 Upgrade transformers to v4.38.0 (#2965) 2024-02-21 09:38:03 -08:00
5253edaacb Add Gemma model (#2964) 2024-02-21 09:34:30 -08:00
017d9f1515 Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
181b27d881 Make vLLM logging formatting optional (#2877) 2024-02-20 14:38:55 -08:00
63e2a6419d [FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
264017a2bf [ROCm] include gfx908 as supported (#2792) 2024-02-19 17:58:59 -08:00
e433c115bc Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
86fd8bb0ac Add warning to prevent changes to benchmark api server (#2858) 2024-02-18 21:36:19 -08:00
ab3a5a8259 Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
a61f0521b8 [Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
537c9755a7 [Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) 2024-02-18 14:39:00 -08:00
786b7f18a5 Add code-revision config argument for Hugging Face Hub (#2892) 2024-02-17 22:36:53 -08:00
8f36444c4f multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
185b2c29e2 Defensively copy sampling_params (#2881)
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
5f08050d8d Bump up to v0.3.1 (#2887) v0.3.1 2024-02-16 15:05:18 -08:00
64da65b322 Prefix Caching- fix t4 triton error (#2517) 2024-02-16 14:17:55 -08:00
5255d99dc5 [ROCm] Dockerfile fix for flash-attention build (#2885) 2024-02-15 10:22:39 -08:00
4f2ad11135 Fix DeciLM (#2883) 2024-02-14 22:29:57 -08:00
d7afab6d3a [BugFix] Fix GC bug for LLM class (#2882) 2024-02-14 22:17:44 -08:00
31348dff03 Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880)
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.

* Make LoRA logic for Mistral and Mixtral the same

---------

Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-02-15 01:00:43 +01:00
25e86b6a61 Don't use cupy NCCL for AMD backends (#2855) 2024-02-14 12:30:44 -08:00
Roy
4efbac6d35 Migrate AquilaForCausalLM to LlamaForCausalLM (#2867) 2024-02-14 12:30:24 -08:00
87069ccf68 Fix docker python version (#2845) 2024-02-14 10:17:57 -08:00
7e45107f51 [Fix] Fix memory profiling when GPU is used by multiple processes (#2863) 2024-02-13 19:52:34 -08:00
0c48b37c31 Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861) 2024-02-13 18:01:15 -08:00
7eacffd951 Migrate InternLMForCausalLM to LlamaForCausalLM (#2860)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 17:12:05 -08:00
2a543d6efe Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
317b29de0f Remove Yi model definition, please use LlamaForCausalLM instead (#2854)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
a463c333dd Use CuPy for CUDA graphs (#2811) 2024-02-13 11:32:06 -08:00