[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)

2024-07-17 15:43:21 +08:00
parent a19e8d3726
commit 5bf35a91e4
23 changed files with 155 additions and 175 deletions
--- a/docs/source/serving/deploying_with_dstack.rst
+++ b/docs/source/serving/deploying_with_dstack.rst
@ -40,7 +40,7 @@ Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7
        gpu: 24GB
    commands:
        - pip install vllm
-        - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
+        - vllm serve $MODEL --port 8000
    model:
        format: openai
        type: chat
--- a/docs/source/serving/distributed_serving.rst
+++ b/docs/source/serving/distributed_serving.rst
@ -35,16 +35,14 @@ To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument wh

 .. code-block:: console

-    $ python -m vllm.entrypoints.openai.api_server \
-    $     --model facebook/opt-13b \
+    $ vllm serve facebook/opt-13b \
    $     --tensor-parallel-size 4

 You can also additionally specify :code:`--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:

 .. code-block:: console

-    $ python -m vllm.entrypoints.openai.api_server \
-    $     --model gpt2 \
+    $ vllm serve gpt2 \
    $     --tensor-parallel-size 4 \
    $     --pipeline-parallel-size 2 \
    $     --distributed-executor-backend ray
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@ -4,7 +4,7 @@ vLLM provides an HTTP server that implements OpenAI's [Completions](https://plat

 You can start the server using Python, or using [Docker](deploying_with_docker.rst):
 ```bash
-python -m vllm.entrypoints.openai.api_server --model NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
+vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
 ```

 To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
@ -97,9 +97,7 @@ template, or the template in string form. Without a chat template, the server wi
 and all chat requests will error.

 ```bash
-python -m vllm.entrypoints.openai.api_server \
-  --model ... \
-  --chat-template ./path-to-chat-template.jinja
+vllm serve <model> --chat-template ./path-to-chat-template.jinja
 ```

 vLLM community provides a set of chat templates for popular models. You can find them in the examples
@ -110,7 +108,7 @@ directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
 ```{argparse}
 :module: vllm.entrypoints.openai.cli_args
 :func: create_parser_for_docs
-:prog: -m vllm.entrypoints.openai.api_server
+:prog: vllm serve
 ```

 ## Tool calling in the chat completion API