[V1] Enable multi-input by default (#15799)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2025-04-12 16:52:39 +08:00
committed by GitHub
parent f069f3ea74
commit d9fc8cd9da
21 changed files with 214 additions and 105 deletions

View File

@ -759,7 +759,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
:::{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
Offline inference:
@ -777,6 +777,8 @@ Online serving:
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
```
**This is no longer required if you are using vLLM V1.**
:::
:::{note}

View File

@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
#### Disable unused modalities
You can disable unused modalities (except for text) by setting its limit to zero.
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
```python
from vllm import LLM
# Accept images but not videos
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"video": 0})
```
You can even run a multi-modal model for text-only inference:
```python
from vllm import LLM
# Don't accept images. Just text.
llm = LLM(model="google/gemma-3-27b-it",
limit_mm_per_prompt={"image": 0})
```
### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options.