[CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a")
|
||||
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
||||
|
||||
```console
|
||||
$ vllm serve facebook/opt-13b \
|
||||
$ --tensor-parallel-size 4
|
||||
vllm serve facebook/opt-13b \
|
||||
--tensor-parallel-size 4
|
||||
```
|
||||
|
||||
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||
|
||||
```console
|
||||
$ vllm serve gpt2 \
|
||||
$ --tensor-parallel-size 4 \
|
||||
$ --pipeline-parallel-size 2
|
||||
vllm serve gpt2 \
|
||||
--tensor-parallel-size 4 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
## Running vLLM on multiple nodes
|
||||
@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have
|
||||
Pick a node as the head node, and run the following command:
|
||||
|
||||
```console
|
||||
$ bash run_cluster.sh \
|
||||
$ vllm/vllm-openai \
|
||||
$ ip_of_head_node \
|
||||
$ --head \
|
||||
$ /path/to/the/huggingface/home/in/this/node
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
ip_of_head_node \
|
||||
--head \
|
||||
/path/to/the/huggingface/home/in/this/node
|
||||
```
|
||||
|
||||
On the rest of the worker nodes, run the following command:
|
||||
|
||||
```console
|
||||
$ bash run_cluster.sh \
|
||||
$ vllm/vllm-openai \
|
||||
$ ip_of_head_node \
|
||||
$ --worker \
|
||||
$ /path/to/the/huggingface/home/in/this/node
|
||||
bash run_cluster.sh \
|
||||
vllm/vllm-openai \
|
||||
ip_of_head_node \
|
||||
--worker \
|
||||
/path/to/the/huggingface/home/in/this/node
|
||||
```
|
||||
|
||||
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
|
||||
@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
|
||||
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||
|
||||
```console
|
||||
$ vllm serve /path/to/the/model/in/the/container \
|
||||
$ --tensor-parallel-size 8 \
|
||||
$ --pipeline-parallel-size 2
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 8 \
|
||||
--pipeline-parallel-size 2
|
||||
```
|
||||
|
||||
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
|
||||
|
||||
```console
|
||||
$ vllm serve /path/to/the/model/in/the/container \
|
||||
$ --tensor-parallel-size 16
|
||||
vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 16
|
||||
```
|
||||
|
||||
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
|
||||
|
||||
@ -7,7 +7,7 @@ vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain
|
||||
To install LangChain, run
|
||||
|
||||
```console
|
||||
$ pip install langchain langchain_community -q
|
||||
pip install langchain langchain_community -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
|
||||
|
||||
@ -7,7 +7,7 @@ vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index
|
||||
To install LlamaIndex, run
|
||||
|
||||
```console
|
||||
$ pip install llama-index-llms-vllm -q
|
||||
pip install llama-index-llms-vllm -q
|
||||
```
|
||||
|
||||
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.
|
||||
|
||||
@ -7,7 +7,7 @@ OpenAI compatible API server.
|
||||
You can start the server using Python, or using [Docker](#deployment-docker):
|
||||
|
||||
```console
|
||||
$ vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
```
|
||||
|
||||
Then query the endpoint to get the latest metrics from the server:
|
||||
|
||||
@ -303,6 +303,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
@ -64,7 +64,7 @@ Dynamic quantization is also supported via the `quantization` option -- see [her
|
||||
|
||||
#### Context length and batch size
|
||||
|
||||
You can further reduce memory usage by limit the context length of the model (`max_model_len` option)
|
||||
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
|
||||
and the maximum batch size (`max_num_seqs` option).
|
||||
|
||||
```python
|
||||
|
||||
@ -5,11 +5,13 @@
|
||||
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more!
|
||||
|
||||
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker):
|
||||
|
||||
```bash
|
||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
||||
```
|
||||
|
||||
To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client.
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
client = OpenAI(
|
||||
@ -50,6 +52,7 @@ In addition, we have the following custom APIs:
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
|
||||
(chat-template)=
|
||||
|
||||
## Chat Template
|
||||
|
||||
In order for the language model to support chat protocol, vLLM requires the model to include
|
||||
@ -71,6 +74,7 @@ vLLM community provides a set of chat templates for popular models. You can find
|
||||
|
||||
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
|
||||
both a `type` and a `text` field. An example is provided below:
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
||||
@ -80,7 +84,7 @@ completion = client.chat.completions.create(
|
||||
)
|
||||
```
|
||||
|
||||
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
|
||||
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
|
||||
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
|
||||
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
|
||||
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
|
||||
@ -115,12 +119,12 @@ completion = client.chat.completions.create(
|
||||
## Extra HTTP Headers
|
||||
|
||||
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
|
||||
with `--enable-request-id-headers`.
|
||||
with `--enable-request-id-headers`.
|
||||
|
||||
> Note that enablement of the headers can impact performance significantly at high QPS
|
||||
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
|
||||
> rather than within the vLLM layer for this reason.
|
||||
> See https://github.com/vllm-project/vllm/pull/11529 for more details.
|
||||
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
@ -147,6 +151,7 @@ print(completion._request_id)
|
||||
## CLI Reference
|
||||
|
||||
(vllm-serve)=
|
||||
|
||||
### `vllm serve`
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
@ -175,7 +180,7 @@ uvicorn-log-level: "info"
|
||||
To use the above config file:
|
||||
|
||||
```bash
|
||||
$ vllm serve SOME_MODEL --config config.yaml
|
||||
vllm serve SOME_MODEL --config config.yaml
|
||||
```
|
||||
|
||||
```{note}
|
||||
@ -186,6 +191,7 @@ The order of priorities is `command line > config file values > defaults`.
|
||||
## API Reference
|
||||
|
||||
(completions-api)=
|
||||
|
||||
### Completions API
|
||||
|
||||
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
|
||||
@ -212,6 +218,7 @@ The following extra parameters are supported:
|
||||
```
|
||||
|
||||
(chat-api)=
|
||||
|
||||
### Chat API
|
||||
|
||||
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
|
||||
@ -243,6 +250,7 @@ The following extra parameters are supported:
|
||||
```
|
||||
|
||||
(embeddings-api)=
|
||||
|
||||
### Embeddings API
|
||||
|
||||
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
|
||||
@ -284,6 +292,7 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
|
||||
```
|
||||
|
||||
(tokenizer-api)=
|
||||
|
||||
### Tokenizer API
|
||||
|
||||
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
|
||||
@ -293,6 +302,7 @@ It consists of two endpoints:
|
||||
- `/detokenize` corresponds to calling `tokenizer.decode()`.
|
||||
|
||||
(pooling-api)=
|
||||
|
||||
### Pooling API
|
||||
|
||||
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
|
||||
@ -302,6 +312,7 @@ The input format is the same as [Embeddings API](#embeddings-api), but the outpu
|
||||
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
|
||||
|
||||
(score-api)=
|
||||
|
||||
### Score API
|
||||
|
||||
Our Score API applies a cross-encoder model to predict scores for sentence pairs.
|
||||
|
||||
Reference in New Issue
Block a user