[Doc] Reorganize user guide (#18661)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@ -1,38 +0,0 @@
|
||||
---
|
||||
title: Frequently Asked Questions
|
||||
---
|
||||
[](){ #faq }
|
||||
|
||||
> Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
|
||||
|
||||
---
|
||||
|
||||
> Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
|
||||
more are listed [here][supported-models].
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
but they are expected to be inferior to models that are specifically trained on embedding tasks.
|
||||
|
||||
---
|
||||
|
||||
> Q: Can the output of a prompt vary across runs in vLLM?
|
||||
|
||||
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
|
||||
|
||||
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||
|
||||
## Mitigation Strategies
|
||||
|
||||
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
|
||||
- If using `bfloat16`, switching to `float16` can also help.
|
||||
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
|
||||
@ -1,270 +0,0 @@
|
||||
---
|
||||
title: Troubleshooting
|
||||
---
|
||||
[](){ #troubleshooting }
|
||||
|
||||
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
!!! note
|
||||
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
|
||||
|
||||
## Hangs downloading a model
|
||||
|
||||
If the model isn't already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection.
|
||||
It's recommended to download the model first using the [huggingface-cli](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and passing the local path to the model to vLLM. This way, you can isolate the issue.
|
||||
|
||||
## Hangs loading a model from disk
|
||||
|
||||
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
|
||||
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
|
||||
|
||||
!!! note
|
||||
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
|
||||
|
||||
## Out of memory
|
||||
|
||||
If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options][reducing-memory-usage] to reduce the memory consumption.
|
||||
|
||||
## Generation quality changed
|
||||
|
||||
In v0.8.0, the source of default sampling parameters was changed in <gh-pr:12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
|
||||
|
||||
In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
|
||||
|
||||
You can check if this is happening by trying the old defaults with `--generation-config vllm` for online and `generation_config="vllm"` for offline. If, after trying this, your generation quality improves we would recommend continuing to use the vLLM defaults and petition the model creator on <https://huggingface.co> to update their default `generation_config.json` so that it produces better quality generations.
|
||||
|
||||
## Enable more logging
|
||||
|
||||
If other strategies don't solve the problem, it's likely that the vLLM instance is stuck somewhere. You can use the following environment variables to help debug the issue:
|
||||
|
||||
- `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging.
|
||||
- `export CUDA_LAUNCH_BLOCKING=1` to identify which CUDA kernel is causing the problem.
|
||||
- `export NCCL_DEBUG=TRACE` to turn on more logging for NCCL.
|
||||
- `export VLLM_TRACE_FUNCTION=1` to record all function calls for inspection in the log files to tell which function crashes or hangs.
|
||||
|
||||
## Incorrect network setup
|
||||
|
||||
The vLLM instance cannot get the correct IP address if you have a complicated network config. You can find a log such as `DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl` and the IP address should be the correct one.
|
||||
If it's not, override the IP address using the environment variable `export VLLM_HOST_IP=<your_ip_address>`.
|
||||
|
||||
You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>` and `export GLOO_SOCKET_IFNAME=<your_network_interface>` to specify the network interface for the IP address.
|
||||
|
||||
## Error near `self.graph.replay()`
|
||||
|
||||
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
|
||||
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
|
||||
|
||||
[](){ #troubleshooting-incorrect-hardware-driver }
|
||||
|
||||
## Incorrect hardware/driver
|
||||
|
||||
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
|
||||
|
||||
```python
|
||||
# Test PyTorch NCCL
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
dist.init_process_group(backend="nccl")
|
||||
local_rank = dist.get_rank() % torch.cuda.device_count()
|
||||
torch.cuda.set_device(local_rank)
|
||||
data = torch.FloatTensor([1,] * 128).to("cuda")
|
||||
dist.all_reduce(data, op=dist.ReduceOp.SUM)
|
||||
torch.cuda.synchronize()
|
||||
value = data.mean().item()
|
||||
world_size = dist.get_world_size()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch NCCL is successful!")
|
||||
|
||||
# Test PyTorch GLOO
|
||||
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
|
||||
cpu_data = torch.FloatTensor([1,] * 128)
|
||||
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
|
||||
value = cpu_data.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("PyTorch GLOO is successful!")
|
||||
|
||||
if world_size <= 1:
|
||||
exit()
|
||||
|
||||
# Test vLLM NCCL, with cuda graph
|
||||
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
|
||||
|
||||
pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
|
||||
# pynccl is enabled by default for 0.6.5+,
|
||||
# but for 0.6.4 and below, we need to enable it manually.
|
||||
# keep the code for backward compatibility when because people
|
||||
# prefer to read the latest documentation.
|
||||
pynccl.disabled = False
|
||||
|
||||
s = torch.cuda.Stream()
|
||||
with torch.cuda.stream(s):
|
||||
data.fill_(1)
|
||||
out = pynccl.all_reduce(data, stream=s)
|
||||
value = out.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL is successful!")
|
||||
|
||||
g = torch.cuda.CUDAGraph()
|
||||
with torch.cuda.graph(cuda_graph=g, stream=s):
|
||||
out = pynccl.all_reduce(data, stream=torch.cuda.current_stream())
|
||||
|
||||
data.fill_(1)
|
||||
g.replay()
|
||||
torch.cuda.current_stream().synchronize()
|
||||
value = out.mean().item()
|
||||
assert value == world_size, f"Expected {world_size}, got {value}"
|
||||
|
||||
print("vLLM NCCL with cuda graph is successful!")
|
||||
|
||||
dist.destroy_process_group(gloo_group)
|
||||
dist.destroy_process_group()
|
||||
```
|
||||
|
||||
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
|
||||
|
||||
```console
|
||||
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
||||
```
|
||||
|
||||
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
||||
|
||||
```console
|
||||
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
|
||||
```
|
||||
|
||||
If the script runs successfully, you should see the message `sanity check is successful!`.
|
||||
|
||||
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
|
||||
|
||||
!!! note
|
||||
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
|
||||
|
||||
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
|
||||
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
|
||||
|
||||
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
|
||||
|
||||
[](){ #troubleshooting-python-multiprocessing }
|
||||
|
||||
## Python multiprocessing
|
||||
|
||||
### `RuntimeError` Exception
|
||||
|
||||
If you have seen a warning in your logs like this:
|
||||
|
||||
```console
|
||||
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
initialized. We must use the `spawn` multiprocessing start method. Setting
|
||||
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
|
||||
https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
|
||||
for more information.
|
||||
```
|
||||
|
||||
or an error from Python that looks like this:
|
||||
|
||||
```console
|
||||
RuntimeError:
|
||||
An attempt has been made to start a new process before the
|
||||
current process has finished its bootstrapping phase.
|
||||
|
||||
This probably means that you are not using fork to start your
|
||||
child processes and you have forgotten to use the proper idiom
|
||||
in the main module:
|
||||
|
||||
if __name__ == '__main__':
|
||||
freeze_support()
|
||||
...
|
||||
|
||||
The "freeze_support()" line can be omitted if the program
|
||||
is not going to be frozen to produce an executable.
|
||||
|
||||
To fix this issue, refer to the "Safe importing of main module"
|
||||
section in https://docs.python.org/3/library/multiprocessing.html
|
||||
```
|
||||
|
||||
then you must update your Python code to guard usage of `vllm` behind a `if
|
||||
__name__ == '__main__':` block. For example, instead of this:
|
||||
|
||||
```python
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
```
|
||||
|
||||
try this instead:
|
||||
|
||||
```python
|
||||
if __name__ == '__main__':
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
```
|
||||
|
||||
## `torch.compile` Error
|
||||
|
||||
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
@torch.compile
|
||||
def f(x):
|
||||
# a simple function to test torch.compile
|
||||
x = x + 1
|
||||
x = x * 2
|
||||
x = x.sin()
|
||||
return x
|
||||
|
||||
x = torch.randn(4, 4).cuda()
|
||||
print(f(x))
|
||||
```
|
||||
|
||||
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
|
||||
|
||||
## Model failed to be inspected
|
||||
|
||||
If you see an error like:
|
||||
|
||||
```text
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
|
||||
raise ValueError(
|
||||
ValueError: Model architectures ['<arch>'] failed to be inspected. Please check the logs for more details.
|
||||
```
|
||||
|
||||
It means that vLLM failed to import the model file.
|
||||
Usually, it is related to missing dependencies or outdated binaries in the vLLM build.
|
||||
Please read the logs carefully to determine the root cause of the error.
|
||||
|
||||
## Model not supported
|
||||
|
||||
If you see an error like:
|
||||
|
||||
```text
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
|
||||
for arch in architectures:
|
||||
TypeError: 'NoneType' object is not iterable
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```text
|
||||
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
|
||||
raise ValueError(
|
||||
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
|
||||
```
|
||||
|
||||
But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps][model-resolution] to explicitly specify the vLLM implementation for the model.
|
||||
|
||||
## Failed to infer device type
|
||||
|
||||
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
|
||||
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
|
||||
@ -1,153 +0,0 @@
|
||||
# vLLM V1 User Guide
|
||||
|
||||
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
|
||||
|
||||
To disable V1, please set the environment variable as: `VLLM_USE_V1=0`, and send us a GitHub issue sharing the reason!
|
||||
|
||||
## Why vLLM V1?
|
||||
|
||||
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
|
||||
|
||||
Building on V0’s success, vLLM V1 retains the stable and proven components from V0
|
||||
(such as the models, GPU kernels, and utilities). At the same time, it significantly
|
||||
re-architects the core systems, covering the scheduler, KV cache manager, worker,
|
||||
sampler, and API server, to provide a cohesive, maintainable framework that better
|
||||
accommodates continued growth and innovation.
|
||||
|
||||
Specifically, V1 aims to:
|
||||
|
||||
- Provide a **simple, modular, and easy-to-hack codebase**.
|
||||
- Ensure **high performance** with near-zero CPU overhead.
|
||||
- **Combine key optimizations** into a unified architecture.
|
||||
- Require **zero configs** by enabling features/optimizations by default.
|
||||
|
||||
We see significant performance improvements from upgrading to V1 core engine, in
|
||||
particular for long context scenarios. Please see performance benchmark (To be
|
||||
added).
|
||||
|
||||
For more details, check out the vLLM V1 blog post [vLLM V1: A Major
|
||||
Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html) (published Jan 27, 2025).
|
||||
|
||||
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
|
||||
|
||||
### Supports Overview
|
||||
#### Hardware
|
||||
|
||||
| Hardware | Status |
|
||||
|----------|------------------------------------------|
|
||||
| **NVIDIA** | <nobr>🚀 Natively Supported</nobr> |
|
||||
| **AMD** | <nobr>🚧 WIP</nobr> |
|
||||
| **TPU** | <nobr>🚧 WIP</nobr> |
|
||||
#### Feature / Model
|
||||
|
||||
| Feature / Model | Status |
|
||||
|-----------------|-----------------------------------------------------------------------------------|
|
||||
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
|
||||
| **LoRA** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
|
||||
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices ([PR #15191](https://github.com/vllm-project/vllm/pull/15191))</nobr>|
|
||||
| **Spec Decode** | <nobr>🚧 WIP ([PR #13933](https://github.com/vllm-project/vllm/pull/13933))</nobr>|
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
|
||||
| **Structured Output Alternative Backends** | <nobr>🟡 Planned</nobr> |
|
||||
| **Embedding Models** | <nobr>🟡 Planned ([RFC #12249](https://github.com/vllm-project/vllm/issues/12249))</nobr> |
|
||||
| **Mamba Models** | <nobr>🟡 Planned</nobr> |
|
||||
| **Encoder-Decoder Models** | <nobr>🟡 Planned</nobr> |
|
||||
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
|
||||
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
|
||||
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
|
||||
|
||||
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
|
||||
- **🟢 Functional**: Fully operational, with ongoing optimizations.
|
||||
- **🚧 WIP**: Under active development.
|
||||
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
|
||||
- **🔴 Deprecated**: Not planned for v1 unless there is strong demand.
|
||||
|
||||
**Note**: vLLM V1’s unified scheduler treats both prompt and output tokens the same
|
||||
way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically
|
||||
allocate a fixed token budget per request, enabling features like chunked prefills,
|
||||
prefix caching, and speculative decoding without a strict separation between prefill
|
||||
and decode phases.
|
||||
|
||||
### Semantic Changes and Deprecated Features
|
||||
|
||||
#### Logprobs
|
||||
|
||||
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
|
||||
differences compared to V0:
|
||||
|
||||
**Logprobs Calculation**
|
||||
|
||||
Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
|
||||
before applying any logits post-processing such as temperature scaling or penalty
|
||||
adjustments). As a result, the returned logprobs do not reflect the final adjusted
|
||||
probabilities used during sampling.
|
||||
|
||||
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.
|
||||
|
||||
**Prompt Logprobs with Prefix Caching**
|
||||
|
||||
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
|
||||
|
||||
#### Deprecated Features
|
||||
|
||||
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
|
||||
|
||||
**Sampling features**
|
||||
|
||||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
|
||||
- **Per-Request Logits Processors**: In V0, users could pass custom
|
||||
processing functions to adjust logits on a per-request basis. In vLLM V1, this
|
||||
feature has been deprecated. Instead, the design is moving toward supporting **global logits
|
||||
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
|
||||
|
||||
**KV Cache features**
|
||||
|
||||
- **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping
|
||||
to handle request preemptions.
|
||||
|
||||
**Structured Output features**
|
||||
|
||||
- **Request-level Structured Output Backend**: Deprecated, alternative backends
|
||||
(outlines, guidance) with fallbacks is WIP.
|
||||
### Feature & Model Support in Progress
|
||||
|
||||
Although we have re-implemented and partially optimized many features and models from V0 in vLLM V1, optimization work is still ongoing for some, and others remain unsupported.
|
||||
|
||||
#### Features to Be Optimized
|
||||
|
||||
These features are already supported in vLLM V1, but their optimization is still
|
||||
in progress.
|
||||
|
||||
- **Spec Decode**: Currently, only ngram-based spec decode is supported in V1. There
|
||||
will be follow-up work to support other types of spec decode (e.g., see [PR #13933](https://github.com/vllm-project/vllm/pull/13933)). We will prioritize the support for Eagle, MTP compared to draft model based spec decode.
|
||||
|
||||
- **Multimodal Models**: V1 is almost fully compatible with V0 except that interleaved modality input is not supported yet.
|
||||
See [here](https://github.com/orgs/vllm-project/projects/8) for the status of upcoming features and optimizations.
|
||||
|
||||
#### Features to Be Supported
|
||||
|
||||
- **Structured Output Alternative Backends**: Structured output alternative backends (outlines, guidance) support is planned. V1 currently
|
||||
supports only the `xgrammar:no_fallback` mode, meaning that it will error out if the output schema is unsupported by xgrammar.
|
||||
Details about the structured outputs can be found
|
||||
[here](https://docs.vllm.ai/en/latest/features/structured_outputs.html).
|
||||
|
||||
#### Models to Be Supported
|
||||
|
||||
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol,
|
||||
and the majority fall into the following categories. V1 support for these models will be added eventually.
|
||||
|
||||
**Embedding Models**
|
||||
Instead of having a separate model runner, hidden states processor [RFC #12249](https://github.com/vllm-project/vllm/issues/12249), which is based on global logits processor [RFC #13360](https://github.com/vllm-project/vllm/pull/13360), has been proposed to enable simultaneous generation and embedding using the same engine instance in V1. It is still in the planning stage.
|
||||
|
||||
**Mamba Models**
|
||||
Models using selective state-space mechanisms (instead of standard transformer attention)
|
||||
are not yet supported (e.g., `MambaForCausalLM`, `JambaForCausalLM`).
|
||||
|
||||
**Encoder-Decoder Models**
|
||||
vLLM V1 is currently optimized for decoder-only transformers. Models requiring
|
||||
cross-attention between separate encoder and decoder are not yet supported (e.g., `BartForConditionalGeneration`, `MllamaForConditionalGeneration`).
|
||||
|
||||
For a complete list of supported models, see the [list of supported models](https://docs.vllm.ai/en/latest/models/supported_models.html).
|
||||
Reference in New Issue
Block a user