d619ae2d19
[Doc] Add better clarity for tensorizer usage ( #4090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-15 13:28:25 -07:00
aceb17cf2d
[Docs] document that mixtral 8x22b is supported ( #4073 )
2024-04-14 14:35:55 -07:00
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer ( #3476 )
2024-04-13 17:13:01 -07:00
c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup ( #3816 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-11 17:17:21 -07:00
f3d0bf7589
[Doc][Installation] delete python setup.py develop ( #3989 )
2024-04-11 03:33:02 +00:00
92cd2e2f21
[Doc] Fix getting stared to use publicly available model ( #3963 )
2024-04-10 18:05:52 +00:00
e35397468f
[Doc] Add doc to state our model support policy ( #3948 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-10 17:03:02 +00:00
b4543c8f6b
[Model] add minicpm ( #3893 )
2024-04-08 18:28:36 +08:00
95baec828f
[Core] enable out-of-tree model register ( #3871 )
2024-04-06 17:11:41 -07:00
d03d64fd2e
[CI/Build] refactor dockerfile & fix pip cache
...
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859 )
2024-04-04 21:53:16 -07:00
78107fa091
[Doc]Add asynchronous engine arguments to documentation. ( #3810 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-04 21:52:01 -07:00
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: HaiShaw <hixiao@gmail.com >
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu >
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com >
Co-authored-by: guofangze <guofangze@kuaishou.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-03 14:15:55 -07:00
3bec41f41a
[Doc] Fix vLLMEngine Doc Page ( #3791 )
2024-04-02 09:49:37 -07:00
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com >
2024-04-01 22:07:30 -07:00
9c82a1bec3
[Doc] Update installation doc ( #3746 )
...
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version (#3746 )
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-30 16:34:38 -07:00
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
d6ea427f04
[Model] Add support for Qwen2MoeModel ( #3346 )
2024-03-28 15:19:59 +00:00
6d9aa00fc4
[Docs] Add Command-R to supported models ( #3669 )
2024-03-27 15:20:00 -07:00
e24336b5a7
[Model] Add support for DBRX ( #3660 )
2024-03-27 13:01:46 -07:00
e66b629c04
[Misc] Minor fix in KVCache type ( #3652 )
2024-03-26 23:14:06 -07:00
76879342a3
[Doc]add lora support ( #3649 )
2024-03-27 02:06:46 +00:00
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
42bc386129
[CI/Build] respect the common environment variable MAX_JOBS ( #3600 )
2024-03-24 17:04:00 -07:00
4c07dd28c0
[ 🚀 Ready to be merged] Added support for Jais models ( #3183 )
2024-03-21 09:45:24 +00:00
63e8b28a99
[Doc] minor fix of spelling in amd-installation.rst ( #3506 )
2024-03-19 20:32:30 +00:00
2a60c9bd17
[Doc] minor fix to neuron-installation.rst ( #3505 )
2024-03-19 13:21:35 -07:00
ef65dcfa6f
[Doc] Add docs about OpenAI compatible server ( #3288 )
2024-03-18 22:05:34 -07:00
8fa7357f2d
fix document error for value and v_vec illustration ( #3421 )
2024-03-15 16:06:09 -07:00
b0925b3878
docs: Add BentoML deployment doc ( #3336 )
...
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com >
2024-03-12 10:34:30 -07:00
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
657061fdce
[docs] Add LoRA support information for models ( #3299 )
2024-03-11 00:54:51 -07:00
99c3cfb83c
[Docs] Fix Unmocked Imports ( #3275 )
2024-03-08 09:58:01 -08:00
27a7b070db
Add document for vllm paged attention kernel. ( #2978 )
2024-03-04 09:23:34 -08:00
d0fae88114
[DOC] add setup document to support neuron backend ( #2777 )
2024-03-04 01:03:51 +00:00
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-03-02 00:50:01 -08:00
49d849b3ab
docs: Add tutorial on deploying vLLM model with KServe ( #2586 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-03-01 11:04:14 -08:00
a8683102cc
multi-lora documentation fix ( #3064 )
2024-02-27 21:26:15 -08:00
8b430d7dea
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM ( #3046 )
2024-02-26 20:23:50 -08:00
48a8f4a7fd
Support Orion model ( #2539 )
...
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-02-26 19:17:06 -08:00
ef978fe411
Port metrics from aioprometheus to prometheus_client ( #2730 )
2024-02-25 11:54:00 -08:00
a9c8212895
[FIX] Add Gemma model to the doc ( #2966 )
2024-02-21 09:46:15 -08:00
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead ( #2854 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 14:22:22 -08:00
f964493274
[CI] Ensure documentation build is checked in CI ( #2842 )
2024-02-12 22:53:07 -08:00
4ca2c358b1
Add documentation section about LoRA ( #2834 )
2024-02-12 17:24:45 +01:00
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention ( #2768 )
2024-02-10 23:14:37 -08:00
931746bc6d
Add documentation on how to do incremental builds ( #2796 )
2024-02-07 14:42:02 -08:00
5ed704ec8c
docs: fix langchain ( #2736 )
2024-02-03 18:17:55 -08:00
cd9e60c76c
Add Internlm2 ( #2666 )
2024-02-01 09:27:40 -08:00