[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)
This commit is contained in:
@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
|
||||
gpu: 24GB
|
||||
commands:
|
||||
- pip install vllm
|
||||
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
|
||||
- vllm serve $MODEL --port 8000
|
||||
model:
|
||||
format: openai
|
||||
type: chat
|
||||
|
||||
@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model facebook/opt-13b \
|
||||
$ vllm serve facebook/opt-13b \
|
||||
$ --tensor-parallel-size 4
|
||||
|
||||
You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.openai.api_server \
|
||||
$ --model gpt2 \
|
||||
$ vllm serve gpt2 \
|
||||
$ --tensor-parallel-size 4 \
|
||||
$ --pipeline-parallel-size 2 \
|
||||
$ --distributed-executor-backend ray
|
||||
|
||||
@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat
|
||||
|
||||
You can start the server using Python, or using [Docker](deploying_with_docker.rst):
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
||||
```
|
||||
|
||||
To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
|
||||
@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
|
||||
and all chat requests will error.
|
||||
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model ... \
|
||||
--chat-template ./path-to-chat-template.jinja
|
||||
vllm serve <model> --chat-template ./path-to-chat-template.jinja
|
||||
```
|
||||
|
||||
vLLM community provides a set of chat templates for popular models. You can find them in the examples
|
||||
@ -110,7 +108,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
|
||||
```{argparse}
|
||||
:module: vllm.entrypoints.openai.cli_args
|
||||
:func: create_parser_for_docs
|
||||
:prog: -m vllm.entrypoints.openai.api_server
|
||||
:prog: vllm serve
|
||||
```
|
||||
|
||||
## Tool calling in the chat completion API
|
||||
|
||||
Reference in New Issue
Block a user