[CI/Build] Add markdown linter (#11857)

Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
Rafael Vasquez
2025-01-12 03:17:13 -05:00
committed by GitHub
parent b25cfab9a0
commit 43f3d9e699
49 changed files with 585 additions and 560 deletions

View File

@ -35,16 +35,16 @@ output = llm.generate("San Franciso is a")
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
```console
$ vllm serve facebook/opt-13b \
$ --tensor-parallel-size 4
vllm serve facebook/opt-13b \
--tensor-parallel-size 4
```
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```console
$ vllm serve gpt2 \
$ --tensor-parallel-size 4 \
$ --pipeline-parallel-size 2
vllm serve gpt2 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2
```
## Running vLLM on multiple nodes
@ -56,21 +56,21 @@ The first step, is to start containers and organize them into a cluster. We have
Pick a node as the head node, and run the following command:
```console
$ bash run_cluster.sh \
$ vllm/vllm-openai \
$ ip_of_head_node \
$ --head \
$ /path/to/the/huggingface/home/in/this/node
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--head \
/path/to/the/huggingface/home/in/this/node
```
On the rest of the worker nodes, run the following command:
```console
$ bash run_cluster.sh \
$ vllm/vllm-openai \
$ ip_of_head_node \
$ --worker \
$ /path/to/the/huggingface/home/in/this/node
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--worker \
/path/to/the/huggingface/home/in/this/node
```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
@ -80,16 +80,16 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
```console
$ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 8 \
$ --pipeline-parallel-size 2
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
```console
$ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 16
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 16
```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.

View File

@ -7,7 +7,7 @@ vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain
To install LangChain, run
```console
$ pip install langchain langchain_community -q
pip install langchain langchain_community -q
```
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.

View File

@ -7,7 +7,7 @@ vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index
To install LlamaIndex, run
```console
$ pip install llama-index-llms-vllm -q
pip install llama-index-llms-vllm -q
```
To run inference on a single or multiple GPUs, use `Vllm` class from `llamaindex`.

View File

@ -7,7 +7,7 @@ OpenAI compatible API server.
You can start the server using Python, or using [Docker](#deployment-docker):
```console
$ vllm serve unsloth/Llama-3.2-1B-Instruct
vllm serve unsloth/Llama-3.2-1B-Instruct
```
Then query the endpoint to get the latest metrics from the server:

View File

@ -303,6 +303,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
```
Then, you can use the OpenAI client as follows:
```python
from openai import OpenAI

View File

@ -64,7 +64,7 @@ Dynamic quantization is also supported via the `quantization` option -- see [her
#### Context length and batch size
You can further reduce memory usage by limit the context length of the model (`max_model_len` option)
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).
```python

View File

@ -5,11 +5,13 @@
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more!
You can start the server via the [`vllm serve`](#vllm-serve) command, or through [Docker](#deployment-docker):
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
```
To call the server, you can use the [official OpenAI Python client](https://github.com/openai/openai-python), or any other HTTP client.
```python
from openai import OpenAI
client = OpenAI(
@ -50,6 +52,7 @@ In addition, we have the following custom APIs:
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
(chat-template)=
## Chat Template
In order for the language model to support chat protocol, vLLM requires the model to include
@ -71,6 +74,7 @@ vLLM community provides a set of chat templates for popular models. You can find
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
```python
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
@ -80,7 +84,7 @@ completion = client.chat.completions.create(
)
```
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
@ -115,12 +119,12 @@ completion = client.chat.completions.create(
## Extra HTTP Headers
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
with `--enable-request-id-headers`.
with `--enable-request-id-headers`.
> Note that enablement of the headers can impact performance significantly at high QPS
> rates. We recommend implementing HTTP headers at the router level (e.g. via Istio),
> rather than within the vLLM layer for this reason.
> See https://github.com/vllm-project/vllm/pull/11529 for more details.
> See [this PR](https://github.com/vllm-project/vllm/pull/11529) for more details.
```python
completion = client.chat.completions.create(
@ -147,6 +151,7 @@ print(completion._request_id)
## CLI Reference
(vllm-serve)=
### `vllm serve`
The `vllm serve` command is used to launch the OpenAI-compatible server.
@ -175,7 +180,7 @@ uvicorn-log-level: "info"
To use the above config file:
```bash
$ vllm serve SOME_MODEL --config config.yaml
vllm serve SOME_MODEL --config config.yaml
```
```{note}
@ -186,6 +191,7 @@ The order of priorities is `command line > config file values > defaults`.
## API Reference
(completions-api)=
### Completions API
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
@ -212,6 +218,7 @@ The following extra parameters are supported:
```
(chat-api)=
### Chat API
Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
@ -243,6 +250,7 @@ The following extra parameters are supported:
```
(embeddings-api)=
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
@ -284,6 +292,7 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
```
(tokenizer-api)=
### Tokenizer API
Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
@ -293,6 +302,7 @@ It consists of two endpoints:
- `/detokenize` corresponds to calling `tokenizer.decode()`.
(pooling-api)=
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
@ -302,6 +312,7 @@ The input format is the same as [Embeddings API](#embeddings-api), but the outpu
Code example: <gh-file:examples/online_serving/openai_pooling_client.py>
(score-api)=
### Score API
Our Score API applies a cross-encoder model to predict scores for sentence pairs.