[Docs] Switch to better markdown linting pre-commit hook (#21851)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@ -6,13 +6,13 @@ toc_depth: 4
|
||||
|
||||
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
|
||||
|
||||
```
|
||||
```bash
|
||||
vllm --help
|
||||
```
|
||||
|
||||
Available Commands:
|
||||
|
||||
```
|
||||
```bash
|
||||
vllm {chat,complete,serve,bench,collect-env,run-batch}
|
||||
```
|
||||
|
||||
|
||||
@ -40,6 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau
|
||||
Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
|
||||
|
||||
#### Reducing compilation time
|
||||
|
||||
This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
|
||||
|
||||
### Optimize based on your data
|
||||
@ -71,12 +72,15 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p
|
||||
|
||||
However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
|
||||
|
||||
**If possible, use the precision that matches the chip’s hardware acceleration**
|
||||
#### Quantization
|
||||
|
||||
If possible, use the precision that matches the chip’s hardware acceleration:
|
||||
|
||||
- v5e has int4/int8 hardware acceleration in the MXU
|
||||
- v6e has int4/int8 hardware acceleration in the MXU
|
||||
|
||||
Supported quantized formats and features in vLLM on TPU [Jul '25]
|
||||
Supported quantized formats and features in vLLM on TPU [Jul '25]:
|
||||
|
||||
- INT8 W8A8
|
||||
- INT8 W8A16
|
||||
- FP8 KV cache
|
||||
@ -84,11 +88,13 @@ Supported quantized formats and features in vLLM on TPU [Jul '25]
|
||||
- [WIP] AWQ
|
||||
- [WIP] FP4 W4A8
|
||||
|
||||
**Don't set TP to be less than the number of chips on a single-host deployment**
|
||||
#### Parallelization
|
||||
|
||||
Don't set TP to be less than the number of chips on a single-host deployment.
|
||||
|
||||
Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
|
||||
|
||||
### Tune your workloads!
|
||||
### Tune your workloads
|
||||
|
||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
||||
|
||||
@ -99,6 +105,7 @@ Although we try to have great default configs, we strongly recommend you check o
|
||||
The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
|
||||
|
||||
#### SPMD
|
||||
|
||||
More details to come.
|
||||
|
||||
**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**
|
||||
|
||||
@ -20,19 +20,19 @@ the failure?
|
||||
|
||||
- **Use this title format:**
|
||||
|
||||
```
|
||||
```text
|
||||
[CI Failure]: failing-test-job - regex/matching/failing:test
|
||||
```
|
||||
|
||||
- **For the environment field:**
|
||||
|
||||
```
|
||||
Still failing on main as of commit abcdef123
|
||||
```text
|
||||
Still failing on main as of commit abcdef123
|
||||
```
|
||||
|
||||
- **In the description, include failing tests:**
|
||||
|
||||
```
|
||||
```text
|
||||
FAILED failing/test.py:failing_test1 - Failure description
|
||||
FAILED failing/test.py:failing_test2 - Failure description
|
||||
https://github.com/orgs/vllm-project/projects/20
|
||||
|
||||
@ -106,6 +106,7 @@ releases (which would take too much time), they can be built from
|
||||
source to unblock the update process.
|
||||
|
||||
### FlashInfer
|
||||
|
||||
Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
|
||||
|
||||
```bash
|
||||
@ -121,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link]
|
||||
team if you want to get the package published there.
|
||||
|
||||
### xFormers
|
||||
|
||||
Similar to FlashInfer, here is how to build and install xFormers from source:
|
||||
|
||||
```bash
|
||||
@ -138,7 +140,7 @@ uv pip install --system \
|
||||
|
||||
### causal-conv1d
|
||||
|
||||
```
|
||||
```bash
|
||||
uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
|
||||
```
|
||||
|
||||
|
||||
@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following:
|
||||
The deprecation process consists of several clearly defined stages that span
|
||||
multiple Y releases:
|
||||
|
||||
**1. Deprecated (Still On By Default)**
|
||||
### 1. Deprecated (Still On By Default)
|
||||
|
||||
- **Action**: Feature is marked as deprecated.
|
||||
- **Timeline**: A removal version is explicitly stated in the deprecation
|
||||
@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0").
|
||||
- GitHub Issue (RFC) for feedback
|
||||
- Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
|
||||
|
||||
**2.Deprecated (Off By Default)**
|
||||
### 2.Deprecated (Off By Default)
|
||||
|
||||
- **Action**: Feature is disabled by default, but can still be re-enabled via a
|
||||
CLI flag or environment variable. Feature throws an error when used without
|
||||
@ -55,7 +55,7 @@ re-enabling.
|
||||
while signaling imminent removal. Ensures any remaining usage is clearly
|
||||
surfaced and blocks silent breakage before full removal.
|
||||
|
||||
**3. Removed**
|
||||
### 3. Removed
|
||||
|
||||
- **Action**: Feature is completely removed from the codebase.
|
||||
- **Note**: Only features that have passed through the previous deprecation
|
||||
|
||||
@ -112,13 +112,13 @@ vllm bench serve \
|
||||
|
||||
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
|
||||
|
||||
```
|
||||
```bash
|
||||
nsys sessions list
|
||||
```
|
||||
|
||||
to get the session id in the form of `profile-XXXXX`, then run:
|
||||
|
||||
```
|
||||
```bash
|
||||
nsys stop --session=profile-XXXXX
|
||||
```
|
||||
|
||||
|
||||
@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report
|
||||
on GitHub. However, if you need to contact the VMT directly for an urgent issue,
|
||||
you may contact the following individuals:
|
||||
|
||||
- Simon Mo - simon.mo@hey.com
|
||||
- Russell Bryant - rbryant@redhat.com
|
||||
- Huzaifa Sidhpurwala - huzaifas@redhat.com
|
||||
- Simon Mo - <simon.mo@hey.com>
|
||||
- Russell Bryant - <rbryant@redhat.com>
|
||||
- Huzaifa Sidhpurwala - <huzaifas@redhat.com>
|
||||
|
||||
## Slack Discussion
|
||||
|
||||
|
||||
@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
||||
- Download and install [Anything LLM desktop](https://anythingllm.com/desktop).
|
||||
|
||||
- On the bottom left of open settings, AI Prooviders --> LLM:
|
||||
- LLM Provider: Generic OpenAI
|
||||
- Base URL: http://{vllm server host}:{vllm server port}/v1
|
||||
- Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ`
|
||||
- LLM Provider: Generic OpenAI
|
||||
- Base URL: http://{vllm server host}:{vllm server port}/v1
|
||||
- Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ`
|
||||
|
||||

|
||||
|
||||
@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
||||

|
||||
|
||||
- Click the upload button:
|
||||
- upload the doc
|
||||
- select the doc and move to the workspace
|
||||
- save and embed
|
||||
- upload the doc
|
||||
- select the doc and move to the workspace
|
||||
- save and embed
|
||||
|
||||

|
||||
|
||||
|
||||
@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
- Download and install [Chatbox desktop](https://chatboxai.app/en#download).
|
||||
|
||||
- On the bottom left of settings, Add Custom Provider
|
||||
- API Mode: `OpenAI API Compatible`
|
||||
- Name: vllm
|
||||
- API Host: `http://{vllm server host}:{vllm server port}/v1`
|
||||
- API Path: `/chat/completions`
|
||||
- Model: `qwen/Qwen1.5-0.5B-Chat`
|
||||
- API Mode: `OpenAI API Compatible`
|
||||
- Name: vllm
|
||||
- API Host: `http://{vllm server host}:{vllm server port}/v1`
|
||||
- API Path: `/chat/completions`
|
||||
- Model: `qwen/Qwen1.5-0.5B-Chat`
|
||||
|
||||

|
||||
|
||||
|
||||
@ -34,11 +34,11 @@ docker compose up -d
|
||||
- In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it.
|
||||
|
||||
- Fill in the model provider details as follows:
|
||||
- **Model Type**: `LLM`
|
||||
- **Model Name**: `Qwen/Qwen1.5-7B-Chat`
|
||||
- **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1`
|
||||
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
|
||||
- **Completion Mode**: `Completion`
|
||||
- **Model Type**: `LLM`
|
||||
- **Model Name**: `Qwen/Qwen1.5-7B-Chat`
|
||||
- **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1`
|
||||
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
|
||||
- **Completion Mode**: `Completion`
|
||||
|
||||

|
||||
|
||||
|
||||
@ -1,7 +1,5 @@
|
||||
# Haystack
|
||||
|
||||
# Haystack
|
||||
|
||||
[Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case.
|
||||
|
||||
It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints.
|
||||
|
||||
@ -3,6 +3,7 @@
|
||||
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
|
||||
|
||||
Here are the integrations:
|
||||
|
||||
- vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus)
|
||||
- vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus)
|
||||
|
||||
|
||||
@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e
|
||||
```
|
||||
|
||||
In this YAML configuration:
|
||||
|
||||
* **`modelSpec`** includes:
|
||||
* `name`: A nickname that you prefer to call the model.
|
||||
* `repository`: Docker repository of vLLM.
|
||||
* `tag`: Docker image tag.
|
||||
* `modelURL`: The LLM model that you want to use.
|
||||
* `name`: A nickname that you prefer to call the model.
|
||||
* `repository`: Docker repository of vLLM.
|
||||
* `tag`: Docker image tag.
|
||||
* `modelURL`: The LLM model that you want to use.
|
||||
* **`replicaCount`**: Number of replicas.
|
||||
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
|
||||
* **`requestGPU`**: Specifies the number of GPUs required.
|
||||
|
||||
@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le
|
||||
- [Deployment with CPUs](#deployment-with-cpus)
|
||||
- [Deployment with GPUs](#deployment-with-gpus)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated)
|
||||
- [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated)
|
||||
- [Conclusion](#conclusion)
|
||||
|
||||
Alternatively, you can deploy vLLM to Kubernetes using any of the following:
|
||||
|
||||
@ -361,7 +361,7 @@ instances in Prometheus.
|
||||
|
||||
We use this concept for the `vllm:cache_config_info` metric:
|
||||
|
||||
```
|
||||
```text
|
||||
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
|
||||
# TYPE vllm:cache_config_info gauge
|
||||
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
|
||||
@ -686,7 +686,7 @@ documentation for this option states:
|
||||
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
|
||||
as:
|
||||
|
||||
```
|
||||
```text
|
||||
-> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117)
|
||||
-> gen_ai.latency.time_in_model_forward: Double(3.151565277099609)
|
||||
-> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676)
|
||||
|
||||
@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica
|
||||
## Detailed Design
|
||||
|
||||
### Overall Process
|
||||
|
||||
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
|
||||
|
||||
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
|
||||
@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b
|
||||
|
||||
The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example:
|
||||
|
||||
```
|
||||
```text
|
||||
cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0
|
||||
```
|
||||
|
||||
@ -70,6 +71,7 @@ pip install "vllm>=0.9.2"
|
||||
## Run xPyD
|
||||
|
||||
### Instructions
|
||||
|
||||
- The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model.
|
||||
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
|
||||
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
|
||||
|
||||
@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified
|
||||
* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
|
||||
* Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments.
|
||||
|
||||
> **Note 1:** We only cache full blocks.
|
||||
!!! note "Note 1"
|
||||
We only cache full blocks.
|
||||
|
||||
> **Note 2:** The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
|
||||
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
|
||||
!!! note "Note 2"
|
||||
The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash.
|
||||
SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context).
|
||||
|
||||
**A hashing example with multi-modality inputs**
|
||||
In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
|
||||
@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache
|
||||
|
||||
With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others.
|
||||
|
||||
> **Note:** Cache isolation is not supported in engine V0.
|
||||
!!! note
|
||||
Cache isolation is not supported in engine V0.
|
||||
|
||||
## Data Structure
|
||||
|
||||
|
||||
@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d
|
||||
|
||||
In the very verbose logs, we can see:
|
||||
|
||||
```
|
||||
```console
|
||||
INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile
|
||||
```
|
||||
|
||||
@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual
|
||||
|
||||
In the very verbose logs, we can also see:
|
||||
|
||||
```
|
||||
```console
|
||||
DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
|
||||
DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py')
|
||||
...
|
||||
@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa
|
||||
|
||||
If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs:
|
||||
|
||||
```
|
||||
```console
|
||||
DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py')
|
||||
```
|
||||
|
||||
|
||||
@ -36,9 +36,9 @@ th:not(:first-child) {
|
||||
|
||||
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|
||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||||
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
|
||||
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | |
|
||||
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | |
|
||||
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
|
||||
| [pooling](../models/pooling_models.md) | ✅\* | ✅\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | |
|
||||
|
||||
@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
|
||||
```
|
||||
|
||||
### Using API Endpoints
|
||||
|
||||
Loading a LoRA Adapter:
|
||||
|
||||
To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary
|
||||
@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \
|
||||
```
|
||||
|
||||
### Using Plugins
|
||||
|
||||
Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter.
|
||||
|
||||
You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds.
|
||||
|
||||
@ -588,7 +588,9 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
|
||||
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
#### Image Embedding Inputs
|
||||
|
||||
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
|
||||
The following example demonstrates how to pass image embeddings to the OpenAI server:
|
||||
|
||||
|
||||
@ -97,7 +97,7 @@ for output in outputs:
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
# Acknowledgement
|
||||
## Acknowledgement
|
||||
|
||||
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and
|
||||
ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
|
||||
|
||||
@ -134,8 +134,8 @@ lm_eval --model vllm \
|
||||
- Employ the chat template or instruction template that the model was trained with
|
||||
- If you've fine-tuned a model, consider using a sample of your training data for calibration
|
||||
- Tune key hyperparameters to the quantization algorithm:
|
||||
- `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
|
||||
- `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency.
|
||||
- `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
|
||||
- `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency.
|
||||
|
||||
The following is an example of an expanded quantization recipe you can tune to your own use case:
|
||||
|
||||
|
||||
@ -50,6 +50,7 @@ Here is an example of how to enable FP8 quantization:
|
||||
```
|
||||
|
||||
The `kv_cache_dtype` argument specifies the data type for KV cache storage:
|
||||
|
||||
- `"auto"`: Uses the model's default "unquantized" data type
|
||||
- `"fp8"` or `"fp8_e4m3"`: Supported on CUDA 11.8+ and ROCm (AMD GPU)
|
||||
- `"fp8_e5m2"`: Supported on CUDA 11.8+
|
||||
|
||||
@ -213,6 +213,7 @@ lm_eval --model vllm \
|
||||
```
|
||||
|
||||
## Quark Quantization Script
|
||||
|
||||
In addition to the example of Python API above, Quark also offers a
|
||||
[quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html)
|
||||
to quantize large language models more conveniently. It supports quantizing models with variety
|
||||
|
||||
@ -13,6 +13,7 @@ pip install \
|
||||
```
|
||||
|
||||
## Quantizing HuggingFace Models
|
||||
|
||||
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
|
||||
|
||||
??? code
|
||||
|
||||
@ -164,7 +164,7 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe
|
||||
|
||||
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
|
||||
|
||||
- This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
|
||||
This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
|
||||
|
||||
### How to do performance tuning for vLLM CPU?
|
||||
|
||||
@ -183,13 +183,13 @@ vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage mu
|
||||
|
||||
### Which quantization configs does vLLM CPU support?
|
||||
|
||||
- vLLM CPU supports quantizations:
|
||||
- vLLM CPU supports quantizations:
|
||||
- AWQ (x86 only)
|
||||
- GPTQ (x86 only)
|
||||
- compressed-tensor INT8 W8A8 (x86, s390x)
|
||||
|
||||
### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
|
||||
|
||||
- Both of them requires `amx` CPU flag.
|
||||
- Both of them requires `amx` CPU flag.
|
||||
- `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
|
||||
- `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.
|
||||
|
||||
@ -339,13 +339,13 @@ Each described step is logged by vLLM server, as follows (negative values corres
|
||||
|
||||
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
|
||||
|
||||
* `{phase}` is either `PROMPT` or `DECODE`
|
||||
- `{phase}` is either `PROMPT` or `DECODE`
|
||||
|
||||
* `{dim}` is either `BS`, `SEQ` or `BLOCK`
|
||||
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
|
||||
|
||||
* `{param}` is either `MIN`, `STEP` or `MAX`
|
||||
- `{param}` is either `MIN`, `STEP` or `MAX`
|
||||
|
||||
* Default values:
|
||||
- Default values:
|
||||
|
||||
| `{phase}` | Parameter | Env Variable | Value Expression |
|
||||
|-----------|-----------|--------------|------------------|
|
||||
|
||||
@ -1,7 +1,8 @@
|
||||
# TPU
|
||||
|
||||
# TPU Supported Models
|
||||
## Text-only Language Models
|
||||
## Supported Models
|
||||
|
||||
### Text-only Language Models
|
||||
|
||||
| Model | Architecture | Supported |
|
||||
|-----------------------------------------------------|--------------------------------|-----------|
|
||||
|
||||
@ -45,10 +45,10 @@ If a model is neither supported natively by vLLM or Transformers, it can still b
|
||||
For a model to be compatible with the Transformers backend for vLLM it must:
|
||||
|
||||
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
|
||||
* The model directory must have the correct structure (e.g. `config.json` is present).
|
||||
* `config.json` must contain `auto_map.AutoModel`.
|
||||
- The model directory must have the correct structure (e.g. `config.json` is present).
|
||||
- `config.json` must contain `auto_map.AutoModel`.
|
||||
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
|
||||
* Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
|
||||
- Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
|
||||
|
||||
If the compatible model is:
|
||||
|
||||
@ -134,10 +134,10 @@ class MyConfig(PretrainedConfig):
|
||||
|
||||
- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
|
||||
- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
|
||||
* You only need to do this for layers which are not present on all pipeline stages
|
||||
* vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
|
||||
* The `list` in the first element of the `tuple` contains the names of the input arguments
|
||||
* The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
|
||||
- You only need to do this for layers which are not present on all pipeline stages
|
||||
- vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
|
||||
- The `list` in the first element of the `tuple` contains the names of the input arguments
|
||||
- The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
|
||||
|
||||
## Loading a Model
|
||||
|
||||
|
||||
@ -99,7 +99,7 @@ From any node, enter a container and run `ray status` and `ray list nodes` to ve
|
||||
### Running vLLM on a Ray cluster
|
||||
|
||||
!!! tip
|
||||
If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
|
||||
If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
|
||||
|
||||
Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient.
|
||||
|
||||
|
||||
@ -31,11 +31,12 @@ vLLM provides three communication backends for EP:
|
||||
|
||||
Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
|
||||
|
||||
```
|
||||
```text
|
||||
EP_SIZE = TP_SIZE × DP_SIZE
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
- `TP_SIZE`: Tensor parallel size (always 1 for now)
|
||||
- `DP_SIZE`: Data parallel size
|
||||
- `EP_SIZE`: Expert parallel size (computed automatically)
|
||||
|
||||
@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
|
||||
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
|
||||
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
|
||||
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
|
||||
|
||||
- *Note: `image_url.detail` parameter is not supported.*
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
|
||||
|
||||
@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure
|
||||
The following options control inter-node communications in vLLM:
|
||||
|
||||
#### 1. **Environment Variables:**
|
||||
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
|
||||
|
||||
- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on
|
||||
|
||||
#### 2. **KV Cache Transfer Configuration:**
|
||||
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
|
||||
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
|
||||
|
||||
- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1)
|
||||
- `--kv-port`: The port for KV cache transfer communications (default: 14579)
|
||||
|
||||
#### 3. **Data Parallel Configuration:**
|
||||
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
|
||||
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
|
||||
|
||||
- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1)
|
||||
- `data_parallel_master_port`: Port of the data parallel master (default: 29500)
|
||||
|
||||
### Notes on PyTorch Distributed
|
||||
|
||||
@ -41,18 +44,21 @@ Key points from the PyTorch security guide:
|
||||
### Security Recommendations
|
||||
|
||||
#### 1. **Network Isolation:**
|
||||
- Deploy vLLM nodes on a dedicated, isolated network
|
||||
- Use network segmentation to prevent unauthorized access
|
||||
- Implement appropriate firewall rules
|
||||
|
||||
- Deploy vLLM nodes on a dedicated, isolated network
|
||||
- Use network segmentation to prevent unauthorized access
|
||||
- Implement appropriate firewall rules
|
||||
|
||||
#### 2. **Configuration Best Practices:**
|
||||
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
|
||||
- Configure firewalls to only allow necessary ports between nodes
|
||||
|
||||
- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults
|
||||
- Configure firewalls to only allow necessary ports between nodes
|
||||
|
||||
#### 3. **Access Control:**
|
||||
- Restrict physical and network access to the deployment environment
|
||||
- Implement proper authentication and authorization for management interfaces
|
||||
- Follow the principle of least privilege for all system components
|
||||
|
||||
- Restrict physical and network access to the deployment environment
|
||||
- Implement proper authentication and authorization for management interfaces
|
||||
- Follow the principle of least privilege for all system components
|
||||
|
||||
## Security and Firewalls: Protecting Exposed vLLM Systems
|
||||
|
||||
|
||||
@ -148,7 +148,7 @@ are not yet supported.
|
||||
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
|
||||
differences compared to V0:
|
||||
|
||||
**Logprobs Calculation**
|
||||
##### Logprobs Calculation
|
||||
|
||||
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
|
||||
before applying any logits post-processing such as temperature scaling or penalty
|
||||
@ -157,7 +157,7 @@ probabilities used during sampling.
|
||||
|
||||
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
|
||||
|
||||
**Prompt Logprobs with Prefix Caching**
|
||||
##### Prompt Logprobs with Prefix Caching
|
||||
|
||||
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414).
|
||||
|
||||
@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v
|
||||
|
||||
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
|
||||
|
||||
**Sampling features**
|
||||
##### Sampling features
|
||||
|
||||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
|
||||
- **Per-Request Logits Processors**: In V0, users could pass custom
|
||||
@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
|
||||
feature has been deprecated. Instead, the design is moving toward supporting **global logits
|
||||
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
|
||||
|
||||
**KV Cache features**
|
||||
##### KV Cache features
|
||||
|
||||
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
|
||||
to handle request preemptions.
|
||||
|
||||
**Structured Output features**
|
||||
##### Structured Output features
|
||||
|
||||
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
|
||||
|
||||
Reference in New Issue
Block a user