264017a2bf
[ROCm] include gfx908 as supported ( #2792 )
2024-02-19 17:58:59 -08:00
e433c115bc
Fix vllm:prompt_tokens_total metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
86fd8bb0ac
Add warning to prevent changes to benchmark api server ( #2858 )
2024-02-18 21:36:19 -08:00
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
537c9755a7
[Minor] Small fix to make distributed init logic in worker looks cleaner ( #2905 )
2024-02-18 14:39:00 -08:00
786b7f18a5
Add code-revision config argument for Hugging Face Hub ( #2892 )
2024-02-17 22:36:53 -08:00
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
185b2c29e2
Defensively copy sampling_params ( #2881 )
...
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.
Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
5f08050d8d
Bump up to v0.3.1 ( #2887 )
v0.3.1
2024-02-16 15:05:18 -08:00
64da65b322
Prefix Caching- fix t4 triton error ( #2517 )
2024-02-16 14:17:55 -08:00
5255d99dc5
[ROCm] Dockerfile fix for flash-attention build ( #2885 )
2024-02-15 10:22:39 -08:00
4f2ad11135
Fix DeciLM ( #2883 )
2024-02-14 22:29:57 -08:00
d7afab6d3a
[BugFix] Fix GC bug for LLM class ( #2882 )
2024-02-14 22:17:44 -08:00
31348dff03
Align LoRA code between Mistral and Mixtral ( fixes #2875 ) ( #2880 )
...
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.
* Make LoRA logic for Mistral and Mixtral the same
---------
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-02-15 01:00:43 +01:00
25e86b6a61
Don't use cupy NCCL for AMD backends ( #2855 )
2024-02-14 12:30:44 -08:00
4efbac6d35
Migrate AquilaForCausalLM to LlamaForCausalLM ( #2867 )
2024-02-14 12:30:24 -08:00
87069ccf68
Fix docker python version ( #2845 )
2024-02-14 10:17:57 -08:00
7e45107f51
[Fix] Fix memory profiling when GPU is used by multiple processes ( #2863 )
2024-02-13 19:52:34 -08:00
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 ( #2861 )
2024-02-13 18:01:15 -08:00
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM ( #2860 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 17:12:05 -08:00
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead ( #2854 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 14:22:22 -08:00
a463c333dd
Use CuPy for CUDA graphs ( #2811 )
2024-02-13 11:32:06 -08:00
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a .
2024-02-13 09:24:59 -08:00
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
f964493274
[CI] Ensure documentation build is checked in CI ( #2842 )
2024-02-12 22:53:07 -08:00
a4211a4dc3
Serving Benchmark Refactoring ( #2433 )
2024-02-12 22:53:00 -08:00
563836496a
Refactor 2 awq gemm kernels into m16nXk32 ( #2723 )
...
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net >
2024-02-12 11:02:17 -08:00
4ca2c358b1
Add documentation section about LoRA ( #2834 )
2024-02-12 17:24:45 +01:00
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention ( #2768 )
2024-02-10 23:14:37 -08:00
3711811b1d
Disable custom all reduce by default ( #2808 )
2024-02-08 09:58:03 -08:00
65b89d16ee
[Ray] Integration compiled DAG off by default ( #2471 )
2024-02-08 09:57:25 -08:00
931746bc6d
Add documentation on how to do incremental builds ( #2796 )
2024-02-07 14:42:02 -08:00
c81dddb45c
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support ( #2790 )
2024-02-06 22:36:59 -08:00
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure ( #2750 )
2024-02-06 11:38:38 -08:00
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. ( #2489 )
2024-02-06 09:57:15 -08:00
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
2ccee3def6
[ROCm] Fixup arch checks for ROCM ( #2627 )
2024-02-05 14:59:09 -08:00
b92adec8e8
Set local logging level via env variable ( #2774 )
2024-02-05 14:26:50 -08:00
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
72d3a30c63
[Minor] Fix benchmark_latency script ( #2765 )
2024-02-05 12:45:37 -08:00
c9b45adeeb
Require triton >= 2.1.0 ( #2746 )
...
Co-authored-by: yangrui1 <yangrui@lanjingren.com >
2024-02-04 23:07:36 -08:00
5a6c81b051
Remove eos tokens from output by default ( #2611 )
2024-02-04 14:32:42 -08:00
51cd22ce56
set&get llm internal tokenizer instead of the TokenizerGroup ( #2741 )
...
Co-authored-by: shujunhua1 <shujunhua1@jd.com >
2024-02-04 14:25:36 -08:00
5ed704ec8c
docs: fix langchain ( #2736 )
2024-02-03 18:17:55 -08:00
4abf6336ec
Add one example to run batch inference distributed on Ray ( #2696 )
2024-02-02 15:41:42 -08:00
0e163fce18
Fix default length_penalty to 1.0 ( #2667 )
2024-02-01 15:59:39 -08:00
96b6f475dd
Remove hardcoded device="cuda" to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-02-01 15:46:39 -08:00
c410f5d020
Use revision when downloading the quantization config file ( #2697 )
...
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-02-01 15:41:58 -08:00