Compare commits
10 Commits
v0.11.1rc2
...
amd_dev
| Author | SHA1 | Date | |
|---|---|---|---|
| c7021f1270 | |||
| 2072fdc044 | |||
| 6eefda507a | |||
| a0003b56b0 | |||
| 5beacce2ea | |||
| 8669c69afa | |||
| 1651003c35 | |||
| 1cb8c6c5fe | |||
| e05a6754a8 | |||
| 084a9dae80 |
@ -1,7 +1,7 @@
|
||||
# default base image
|
||||
ARG REMOTE_VLLM="0"
|
||||
ARG COMMON_WORKDIR=/app
|
||||
ARG BASE_IMAGE=rocm/vllm-dev:base
|
||||
ARG BASE_IMAGE=rocm/vllm-dev:base_custom_1020_rc1_20251008_tuned_20251008
|
||||
|
||||
FROM ${BASE_IMAGE} AS base
|
||||
|
||||
|
||||
@ -213,22 +213,22 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),
|
||||
|
||||

|
||||
|
||||
**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
|
||||
**Time 2: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
|
||||
|
||||

|
||||

|
||||
|
||||
**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
|
||||
**Time 3: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
|
||||
|
||||

|
||||

|
||||
|
||||
**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
|
||||
**Time 4: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
|
||||
|
||||

|
||||

|
||||
|
||||
**Time 6: Request 1 is finished and free.**
|
||||
**Time 5: Request 1 is finished and free.**
|
||||
|
||||

|
||||

|
||||
|
||||
**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
|
||||
**Time 6: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
|
||||
|
||||

|
||||

|
||||
|
||||
@ -16,8 +16,8 @@
|
||||
| meta-llama/Llama-4-* | Llama4ForConditionalGeneration | ❌ |
|
||||
| microsoft/Phi-3-mini-128k-instruct | Phi3ForCausalLM | 🟨 |
|
||||
| microsoft/phi-4 | Phi3ForCausalLM | ❌ |
|
||||
| google/gemma-3-27b-it | TransformersForMultimodalLM | 🟨 |
|
||||
| google/gemma-3-4b-it | TransformersForMultimodalLM | ❌ |
|
||||
| google/gemma-3-27b-it | Gemma3ForConditionalGeneration | 🟨 |
|
||||
| google/gemma-3-4b-it | Gemma3ForConditionalGeneration | ❌ |
|
||||
| deepseek-ai/DeepSeek-R1 | DeepseekV3ForCausalLM | ❌ |
|
||||
| deepseek-ai/DeepSeek-V3 | DeepseekV3ForCausalLM | ❌ |
|
||||
| RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 | LlamaForCausalLM | ✅ |
|
||||
|
||||
@ -642,6 +642,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
|
||||
| `DeepseekOCRForCausalLM` | DeepSeek-OCR | T + I<sup>+</sup> | `deepseek-ai/DeepSeek-OCR`, etc. | | ✅︎ |
|
||||
| `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I<sup>+</sup>/ V<sup>+</sup> | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ |
|
||||
| `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ |
|
||||
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ |
|
||||
| `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | |
|
||||
| `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ |
|
||||
| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ |
|
||||
@ -671,6 +672,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
|
||||
| `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ |
|
||||
| `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ |
|
||||
| `Ovis2_5` | Ovis2.5 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.5-9B`, etc. | | |
|
||||
| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I<sup>E</sup> | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | | ✅︎ |
|
||||
| `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + I<sup>E+</sup> | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | | ✅︎ |
|
||||
| `Phi4MMForCausalLM` | Phi-4-multimodal | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ |
|
||||
| `Phi4MultimodalForCausalLM` | Phi-4-multimodal (HF Transformers) | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct` (with revision `refs/pr/70`), etc. | ✅︎ | ✅︎ |
|
||||
@ -695,8 +697,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|
||||
|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------|
|
||||
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ |
|
||||
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ |
|
||||
| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I<sup>E</sup> | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | ✅︎ | ✅︎ |
|
||||
|
||||
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
|
||||
• For example, to use DeepSeek-VL2 series models:
|
||||
@ -705,7 +705,21 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
||||
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
|
||||
|
||||
!!! warning
|
||||
For `Gemma3ForConditionalGeneration`, `{"do_pan_and_scan": true}` is not supported in Transformers backend yet.
|
||||
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
|
||||
However, there are differences in how they handle text + image inputs:
|
||||
|
||||
V0 correctly implements the model's attention pattern:
|
||||
- Uses bidirectional attention between the image tokens corresponding to the same image
|
||||
- Uses causal attention for other tokens
|
||||
- Implemented via (naive) PyTorch SDPA with masking tensors
|
||||
- Note: May use significant memory for long prompts with image
|
||||
|
||||
V1 currently uses a simplified attention pattern:
|
||||
- Uses causal attention for all tokens, including image tokens
|
||||
- Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}`
|
||||
- Will be updated in the future to support the correct behavior
|
||||
|
||||
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
|
||||
|
||||
!!! note
|
||||
`Gemma3nForConditionalGeneration` is only supported on V1 due to shared KV caching and it depends on `timm>=1.0.17` to make use of its
|
||||
@ -757,6 +771,9 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
||||
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
|
||||
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
|
||||
|
||||
!!! warning
|
||||
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
|
||||
|
||||
!!! note
|
||||
For Qwen2.5-Omni and Qwen3-Omni, reading audio from video pre-processing (`--mm-processor-kwargs '{"use_audio_in_video": true}'`) is currently work in progress and not yet supported.
|
||||
|
||||
|
||||
@ -319,8 +319,7 @@ def run_gemma3(questions: list[str], modality: str) -> ModelRequestData:
|
||||
model=model_name,
|
||||
max_model_len=2048,
|
||||
max_num_seqs=2,
|
||||
# TODO: Support this in transformers backend
|
||||
# mm_processor_kwargs={"do_pan_and_scan": True},
|
||||
mm_processor_kwargs={"do_pan_and_scan": True},
|
||||
limit_mm_per_prompt={modality: 1},
|
||||
)
|
||||
|
||||
|
||||
@ -25,7 +25,7 @@ from vllm.distributed.parallel_state import (
|
||||
initialize_model_parallel,
|
||||
)
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
from ..models.registry import HF_EXAMPLE_MODELS
|
||||
from ..utils import (
|
||||
|
||||
@ -31,7 +31,7 @@ from vllm.model_executor.layers.quantization.utils.w8a8_utils import (
|
||||
GroupShape,
|
||||
)
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
from ..utils import has_module_attribute, multi_gpu_test
|
||||
from .backend import TestBackend
|
||||
|
||||
@ -29,7 +29,7 @@ from vllm.distributed.parallel_state import (
|
||||
from vllm.model_executor.layers.layernorm import RMSNorm
|
||||
from vllm.model_executor.layers.quantization.utils.w8a8_utils import Fp8LinearOp
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
from ..utils import multi_gpu_test
|
||||
from .backend import TestBackend
|
||||
|
||||
@ -15,7 +15,7 @@ from vllm.distributed.parallel_state import (
|
||||
get_tp_group,
|
||||
init_distributed_environment,
|
||||
)
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
|
||||
def distributed_run(fn, world_size):
|
||||
|
||||
@ -23,7 +23,7 @@ from vllm.distributed.parallel_state import (
|
||||
initialize_model_parallel,
|
||||
)
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
torch.manual_seed(42)
|
||||
random.seed(44)
|
||||
|
||||
@ -18,7 +18,7 @@ from vllm.distributed.parallel_state import (
|
||||
graph_capture,
|
||||
init_distributed_environment,
|
||||
)
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
|
||||
def distributed_run(fn, world_size):
|
||||
|
||||
@ -10,8 +10,8 @@ import torch.distributed as dist
|
||||
|
||||
from vllm.distributed.device_communicators.shm_broadcast import MessageQueue
|
||||
from vllm.distributed.utils import StatelessProcessGroup
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.network_utils import get_open_port
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
|
||||
def get_arrays(n: int, seed: int = 0) -> list[np.ndarray]:
|
||||
|
||||
@ -23,7 +23,7 @@ from vllm.distributed.parallel_state import (
|
||||
from vllm.engine.arg_utils import EngineArgs
|
||||
from vllm.engine.llm_engine import LLMEngine
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
torch.manual_seed(42)
|
||||
random.seed(44)
|
||||
|
||||
@ -10,8 +10,8 @@ import torch
|
||||
import vllm.envs as envs
|
||||
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
|
||||
from vllm.distributed.utils import StatelessProcessGroup
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.network_utils import get_open_port
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
from vllm.utils.torch_utils import cuda_device_count_stateless
|
||||
|
||||
from ..utils import multi_gpu_test
|
||||
|
||||
@ -13,7 +13,7 @@ from vllm.distributed.parallel_state import (
|
||||
)
|
||||
from vllm.model_executor.layers.mamba.mamba_mixer2 import Mixer2RMSNormGated
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
|
||||
@multi_gpu_test(num_gpus=2)
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
MODELS = ["google/gemma-2b", "google/gemma-2-2b"]
|
||||
MODELS = ["google/gemma-2b", "google/gemma-2-2b", "google/gemma-3-4b-it"]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("model", MODELS)
|
||||
@ -14,8 +14,14 @@ def test_dummy_loader(vllm_runner, monkeypatch, model: str) -> None:
|
||||
model,
|
||||
load_format="dummy",
|
||||
) as llm:
|
||||
normalizers = llm.apply_model(
|
||||
lambda model: model.model.normalizer.cpu().item()
|
||||
)
|
||||
config = llm.llm.llm_engine.model_config.hf_config
|
||||
if model == "google/gemma-3-4b-it":
|
||||
normalizers = llm.llm.collective_rpc(
|
||||
lambda self: self.model_runner.model.language_model.model.normalizer.cpu().item() # noqa: E501
|
||||
)
|
||||
config = llm.llm.llm_engine.model_config.hf_config.text_config
|
||||
else:
|
||||
normalizers = llm.llm.collective_rpc(
|
||||
lambda self: self.model_runner.model.model.normalizer.cpu().item()
|
||||
)
|
||||
config = llm.llm.llm_engine.model_config.hf_config
|
||||
assert np.allclose(normalizers, config.hidden_size**0.5, rtol=2e-3)
|
||||
|
||||
@ -112,6 +112,25 @@ VLM_TEST_SETTINGS = {
|
||||
vllm_runner_kwargs={"enable_mm_embeds": True},
|
||||
marks=[pytest.mark.core_model, pytest.mark.cpu_model],
|
||||
),
|
||||
"paligemma": VLMTestInfo(
|
||||
models=["google/paligemma-3b-mix-224"],
|
||||
test_type=VLMTestType.IMAGE,
|
||||
prompt_formatter=identity,
|
||||
img_idx_to_prompt=lambda idx: "",
|
||||
# Paligemma uses its own sample prompts because the default one fails
|
||||
single_image_prompts=IMAGE_ASSETS.prompts(
|
||||
{
|
||||
"stop_sign": "caption es",
|
||||
"cherry_blossom": "What is in the picture?",
|
||||
}
|
||||
),
|
||||
auto_cls=AutoModelForImageTextToText,
|
||||
vllm_output_post_proc=model_utils.paligemma_vllm_to_hf_output,
|
||||
dtype="bfloat16",
|
||||
marks=[
|
||||
pytest.mark.skip(reason="vLLM does not support PrefixLM attention mask")
|
||||
],
|
||||
),
|
||||
"qwen2_5_vl": VLMTestInfo(
|
||||
models=["Qwen/Qwen2.5-VL-3B-Instruct"],
|
||||
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE, VLMTestType.VIDEO),
|
||||
@ -176,24 +195,14 @@ VLM_TEST_SETTINGS = {
|
||||
# Gemma3 has bidirectional mask on images
|
||||
"gemma3-transformers": VLMTestInfo(
|
||||
models=["google/gemma-3-4b-it"],
|
||||
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
|
||||
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
|
||||
single_image_prompts=IMAGE_ASSETS.prompts(
|
||||
{
|
||||
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
|
||||
"cherry_blossom": "<start_of_image>What is the season?",
|
||||
}
|
||||
),
|
||||
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
|
||||
max_model_len=8192,
|
||||
test_type=VLMTestType.IMAGE,
|
||||
prompt_formatter=lambda vid_prompt: f"<'<bos><start_of_turn>user\n{vid_prompt}<start_of_image><end_of_turn>\n<start_of_turn>model\n", # noqa: E501
|
||||
max_model_len=4096,
|
||||
auto_cls=AutoModelForImageTextToText,
|
||||
# TODO: Support `do_pan_and_scan` in transformers backend
|
||||
# patch_hf_runner=model_utils.gemma3_patch_hf_runner,
|
||||
vllm_output_post_proc=model_utils.gemma3_vllm_to_hf_output,
|
||||
image_size_factors=[(0.25, 0.5, 1.0)],
|
||||
vllm_runner_kwargs={
|
||||
"model_impl": "transformers",
|
||||
# "mm_processor_kwargs": {"do_pan_and_scan": True},
|
||||
},
|
||||
marks=[pytest.mark.core_model],
|
||||
),
|
||||
@ -212,27 +221,6 @@ VLM_TEST_SETTINGS = {
|
||||
},
|
||||
marks=[pytest.mark.core_model],
|
||||
),
|
||||
# PaliGemma has PrefixLM attention
|
||||
"paligemma-transformers": VLMTestInfo(
|
||||
models=["google/paligemma-3b-mix-224"],
|
||||
test_type=VLMTestType.IMAGE,
|
||||
prompt_formatter=identity,
|
||||
img_idx_to_prompt=lambda idx: "",
|
||||
# PaliGemma uses its own sample prompts because the default one fails
|
||||
single_image_prompts=IMAGE_ASSETS.prompts(
|
||||
{
|
||||
"stop_sign": "caption es",
|
||||
"cherry_blossom": "What is in the picture?",
|
||||
}
|
||||
),
|
||||
auto_cls=AutoModelForImageTextToText,
|
||||
vllm_output_post_proc=model_utils.paligemma_vllm_to_hf_output,
|
||||
image_size_factors=[(0.25, 0.5, 1.0)],
|
||||
vllm_runner_kwargs={
|
||||
"model_impl": "transformers",
|
||||
},
|
||||
marks=[pytest.mark.core_model],
|
||||
),
|
||||
# Pixel values from processor are not 4D or 5D arrays
|
||||
"qwen2_5_vl-transformers": VLMTestInfo(
|
||||
models=["Qwen/Qwen2.5-VL-3B-Instruct"],
|
||||
@ -359,6 +347,24 @@ VLM_TEST_SETTINGS = {
|
||||
image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)],
|
||||
marks=[large_gpu_mark(min_gb=32)],
|
||||
),
|
||||
"gemma3": VLMTestInfo(
|
||||
models=["google/gemma-3-4b-it"],
|
||||
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
|
||||
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
|
||||
single_image_prompts=IMAGE_ASSETS.prompts(
|
||||
{
|
||||
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
|
||||
"cherry_blossom": "<start_of_image>What is the season?",
|
||||
}
|
||||
),
|
||||
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
|
||||
max_model_len=4096,
|
||||
max_num_seqs=2,
|
||||
auto_cls=AutoModelForImageTextToText,
|
||||
vllm_runner_kwargs={"mm_processor_kwargs": {"do_pan_and_scan": True}},
|
||||
patch_hf_runner=model_utils.gemma3_patch_hf_runner,
|
||||
num_logprobs=10,
|
||||
),
|
||||
"glm4v": VLMTestInfo(
|
||||
models=["zai-org/glm-4v-9b"],
|
||||
test_type=VLMTestType.IMAGE,
|
||||
|
||||
@ -328,6 +328,16 @@ def gemma3_patch_hf_runner(hf_model: HfRunner) -> HfRunner:
|
||||
|
||||
hf_model.processor = processor
|
||||
|
||||
orig_generate = hf_model.model.generate
|
||||
|
||||
def _generate(self, *args, **kwargs):
|
||||
# FIXME: https://github.com/huggingface/transformers/issues/38333
|
||||
kwargs["disable_compile"] = True
|
||||
|
||||
return orig_generate(*args, **kwargs)
|
||||
|
||||
hf_model.model.generate = types.MethodType(_generate, hf_model.model)
|
||||
|
||||
return hf_model
|
||||
|
||||
|
||||
|
||||
@ -222,6 +222,7 @@ def _test_processing_correctness(
|
||||
_ADD_SPECIAL_TOKENS_OVERRIDES = {
|
||||
"ovis": False,
|
||||
"ovis2_5": False,
|
||||
"paligemma": False,
|
||||
"ultravox": False,
|
||||
"whisper": False,
|
||||
}
|
||||
@ -333,6 +334,7 @@ def _test_processing_correctness_one(
|
||||
"deepseek-ai/deepseek-vl2-tiny",
|
||||
"baidu/ERNIE-4.5-VL-28B-A3B-PT",
|
||||
"adept/fuyu-8b",
|
||||
"google/gemma-3-4b-it",
|
||||
"google/gemma-3n-E2B-it",
|
||||
"zai-org/glm-4v-9b",
|
||||
"zai-org/GLM-4.1V-9B-Thinking",
|
||||
@ -369,6 +371,8 @@ def _test_processing_correctness_one(
|
||||
"AIDC-AI/Ovis1.6-Llama3.2-3B",
|
||||
"AIDC-AI/Ovis2-1B",
|
||||
"AIDC-AI/Ovis2.5-2B",
|
||||
"google/paligemma-3b-mix-224",
|
||||
"google/paligemma2-3b-ft-docci-448",
|
||||
"microsoft/Phi-3.5-vision-instruct",
|
||||
"microsoft/Phi-4-multimodal-instruct",
|
||||
"mistralai/Pixtral-12B-2409",
|
||||
|
||||
@ -48,6 +48,7 @@ ARCH_NEEDS_EXTRAS = [
|
||||
"Idefics3ForConditionalGeneration",
|
||||
"LlavaForConditionalGeneration",
|
||||
"MiniCPMV",
|
||||
"PaliGemmaForConditionalGeneration",
|
||||
]
|
||||
REPO_ID_TO_SKIP = {
|
||||
"nm-testing/pixtral-12b-FP8-dynamic": "duplicated test",
|
||||
|
||||
@ -19,8 +19,8 @@ from vllm.model_executor.models.vision import (
|
||||
run_dp_sharded_vision_model,
|
||||
)
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.network_utils import get_open_port
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
|
||||
pytestmark = pytest.mark.cpu_test
|
||||
|
||||
|
||||
19
tests/utils_/test_system_utils.py
Normal file
19
tests/utils_/test_system_utils.py
Normal file
@ -0,0 +1,19 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
from vllm.utils.system_utils import unique_filepath
|
||||
|
||||
|
||||
def test_unique_filepath():
|
||||
temp_dir = tempfile.mkdtemp()
|
||||
path_fn = lambda i: Path(temp_dir) / f"file_{i}.txt"
|
||||
paths = set()
|
||||
for i in range(10):
|
||||
path = unique_filepath(path_fn)
|
||||
path.write_text("test")
|
||||
paths.add(path)
|
||||
assert len(paths) == 10
|
||||
assert len(list(Path(temp_dir).glob("*.txt"))) == 10
|
||||
@ -19,7 +19,6 @@ from vllm.transformers_utils.detokenizer_utils import convert_ids_list_to_tokens
|
||||
from vllm.utils import (
|
||||
FlexibleArgumentParser,
|
||||
bind_kv_cache,
|
||||
unique_filepath,
|
||||
)
|
||||
from ..utils import create_new_process_for_each_test, flat_product
|
||||
|
||||
@ -466,18 +465,6 @@ def test_load_config_file(tmp_path):
|
||||
os.remove(str(config_file_path))
|
||||
|
||||
|
||||
def test_unique_filepath():
|
||||
temp_dir = tempfile.mkdtemp()
|
||||
path_fn = lambda i: Path(temp_dir) / f"file_{i}.txt"
|
||||
paths = set()
|
||||
for i in range(10):
|
||||
path = unique_filepath(path_fn)
|
||||
path.write_text("test")
|
||||
paths.add(path)
|
||||
assert len(paths) == 10
|
||||
assert len(list(Path(temp_dir).glob("*.txt"))) == 10
|
||||
|
||||
|
||||
def test_flat_product():
|
||||
# Check regular itertools.product behavior
|
||||
result1 = list(flat_product([1, 2, 3], ["a", "b"]))
|
||||
|
||||
@ -8,6 +8,7 @@ import torch
|
||||
|
||||
import vllm.v1.core.kv_cache_utils as kv_cache_utils
|
||||
from vllm.config import ModelConfig, SchedulerConfig, VllmConfig
|
||||
from vllm.lora.request import LoRARequest
|
||||
from vllm.multimodal.inputs import (
|
||||
MultiModalFeatureSpec,
|
||||
MultiModalKwargsItem,
|
||||
@ -449,6 +450,24 @@ def test_generate_block_hash_extra_keys_cache_salt():
|
||||
assert next_mm_idx == 1
|
||||
|
||||
|
||||
def test_generate_block_hash_extra_keys_lora():
|
||||
request = make_request(
|
||||
request_id="0",
|
||||
prompt_token_ids=[_ for _ in range(6)],
|
||||
)
|
||||
|
||||
request.lora_request = LoRARequest(
|
||||
lora_name="test_lora_adapter", lora_int_id=1, lora_path="/path/to/lora"
|
||||
)
|
||||
|
||||
extra_keys, _ = generate_block_hash_extra_keys(request, 0, 3, 0)
|
||||
assert extra_keys == ("test_lora_adapter",)
|
||||
|
||||
request.lora_request = None
|
||||
extra_keys, _ = generate_block_hash_extra_keys(request, 0, 3, 0)
|
||||
assert extra_keys is None
|
||||
|
||||
|
||||
@pytest.mark.parametrize("hash_fn", [sha256, sha256_cbor])
|
||||
def test_hash_block_tokens(hash_fn):
|
||||
parent_block_hash = BlockHash(b"123")
|
||||
|
||||
@ -21,8 +21,8 @@ from vllm.distributed.parallel_state import (
|
||||
from vllm.model_executor.layers.mamba.mamba_mixer2 import MambaMixer2
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.sampling_params import SamplingParams
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.mem_constants import GiB_bytes
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
from vllm.v1.core.kv_cache_utils import estimate_max_model_len, get_kv_cache_configs
|
||||
from vllm.v1.core.sched.output import CachedRequestData, NewRequestData, SchedulerOutput
|
||||
from vllm.v1.kv_cache_interface import (
|
||||
|
||||
@ -8,7 +8,7 @@ from vllm import envs
|
||||
from vllm.config import VllmConfig, set_current_vllm_config
|
||||
from vllm.logger import init_logger
|
||||
from vllm.platforms import current_platform
|
||||
from vllm.utils import set_env_var
|
||||
from vllm.utils.system_utils import set_env_var
|
||||
|
||||
from .post_cleanup import PostCleanupPass
|
||||
from .vllm_inductor_pass import VllmInductorPass
|
||||
|
||||
@ -114,7 +114,7 @@ class VllmPatternMatcherPass(VllmInductorPass):
|
||||
|
||||
debug_dump_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
from vllm.utils import unique_filepath
|
||||
from vllm.utils.system_utils import unique_filepath
|
||||
|
||||
file_path = unique_filepath(
|
||||
lambda i: debug_dump_path / f"patterns.{self.pass_name}.{i}.py"
|
||||
|
||||
@ -290,7 +290,7 @@ class CompilationConfig:
|
||||
constructor, e.g. `CompilationConfig(inductor_passes={"a": func})`."""
|
||||
|
||||
# CudaGraph compilation
|
||||
cudagraph_mode: CUDAGraphMode | None = None
|
||||
cudagraph_mode: CUDAGraphMode | None = CUDAGraphMode.FULL
|
||||
"""
|
||||
The mode of the cudagraph:
|
||||
|
||||
@ -521,6 +521,16 @@ class CompilationConfig:
|
||||
count_all = self.custom_ops.count("all")
|
||||
assert count_none + count_all <= 1, "Can only specify 'none' or 'all'"
|
||||
|
||||
if "+rms_norm" not in self.custom_ops and "-rms_norm" not in self.custom_ops:
|
||||
self.custom_ops.append("+rms_norm")
|
||||
if (
|
||||
"+silu_and_mul" not in self.custom_ops
|
||||
and "-silu_and_mul" not in self.custom_ops
|
||||
):
|
||||
self.custom_ops.append("+silu_and_mul")
|
||||
if "+quant_fp8" not in self.custom_ops and "-quant_fp8" not in self.custom_ops:
|
||||
self.custom_ops.append("+quant_fp8")
|
||||
|
||||
# TODO(zou3519/luka): There are 2 issues with auto-functionalization V2:
|
||||
# 1. A bug in PyTorch, fixed in 2.7:
|
||||
# https://github.com/pytorch/pytorch/issues/147924
|
||||
@ -752,7 +762,12 @@ class CompilationConfig:
|
||||
# captured. see https://github.com/vllm-project/vllm/pull/20059
|
||||
# for details. Make a copy to avoid mutating the class-level
|
||||
# list via reference.
|
||||
self.splitting_ops = list(self._attention_ops)
|
||||
self.splitting_ops = (
|
||||
[]
|
||||
if self.cudagraph_mode == CUDAGraphMode.FULL
|
||||
else list(self._attention_ops)
|
||||
)
|
||||
|
||||
elif len(self.splitting_ops) == 0:
|
||||
logger.warning_once("Using piecewise compilation with empty splitting_ops")
|
||||
if self.cudagraph_mode == CUDAGraphMode.PIECEWISE:
|
||||
|
||||
@ -2,6 +2,9 @@
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
|
||||
|
||||
from typing import Literal
|
||||
|
||||
from pydantic import Field
|
||||
from pydantic.dataclasses import dataclass
|
||||
|
||||
from vllm.config.utils import config
|
||||
@ -17,7 +20,7 @@ class KVEventsConfig:
|
||||
Events can be published externally by zmq using the event publisher config.
|
||||
"""
|
||||
|
||||
publisher: str = "null"
|
||||
publisher: Literal["null", "zmq"] = Field(default=None)
|
||||
"""The publisher to use for publishing kv events. Can be "null", "zmq".
|
||||
"""
|
||||
|
||||
@ -47,3 +50,7 @@ class KVEventsConfig:
|
||||
"""The topic to use for the event publisher. Consumers can subscribe to
|
||||
this topic to receive events.
|
||||
"""
|
||||
|
||||
def __post_init__(self):
|
||||
if self.publisher is None:
|
||||
self.publisher = "zmq" if self.enable_kv_cache_events else "null"
|
||||
|
||||
@ -22,7 +22,7 @@ from vllm.logger import init_logger
|
||||
from vllm.model_executor.layers.batch_invariant import (
|
||||
vllm_is_batch_invariant,
|
||||
)
|
||||
from vllm.utils import update_environment_variables
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
from vllm.utils.torch_utils import cuda_device_count_stateless
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
@ -353,12 +353,12 @@ class EventPublisherFactory:
|
||||
cls, config: KVEventsConfig | None, data_parallel_rank: int = 0
|
||||
) -> EventPublisher:
|
||||
"""Create publisher from a config mapping."""
|
||||
if not config:
|
||||
if config is None or config.publisher == "null":
|
||||
return NullEventPublisher()
|
||||
|
||||
config_dict = asdict(config)
|
||||
|
||||
kind = config_dict.pop("publisher", "null")
|
||||
kind = config_dict.pop("publisher")
|
||||
config_dict.pop("enable_kv_cache_events")
|
||||
try:
|
||||
constructor = cls._registry[kind]
|
||||
|
||||
@ -18,12 +18,9 @@ from vllm.entrypoints.openai.cli_args import make_arg_parser, validate_parsed_se
|
||||
from vllm.entrypoints.utils import VLLM_SUBCMD_PARSER_EPILOG
|
||||
from vllm.logger import init_logger
|
||||
from vllm.usage.usage_lib import UsageContext
|
||||
from vllm.utils import (
|
||||
FlexibleArgumentParser,
|
||||
decorate_logs,
|
||||
set_process_title,
|
||||
)
|
||||
from vllm.utils import FlexibleArgumentParser
|
||||
from vllm.utils.network_utils import get_tcp_uri
|
||||
from vllm.utils.system_utils import decorate_logs, set_process_title
|
||||
from vllm.v1.engine.core import EngineCoreProc
|
||||
from vllm.v1.engine.utils import CoreEngineProcManager, launch_core_engines
|
||||
from vllm.v1.executor import Executor
|
||||
|
||||
@ -108,13 +108,9 @@ from vllm.entrypoints.utils import (
|
||||
from vllm.logger import init_logger
|
||||
from vllm.reasoning import ReasoningParserManager
|
||||
from vllm.usage.usage_lib import UsageContext
|
||||
from vllm.utils import (
|
||||
Device,
|
||||
FlexibleArgumentParser,
|
||||
decorate_logs,
|
||||
set_ulimit,
|
||||
)
|
||||
from vllm.utils import Device, FlexibleArgumentParser, set_ulimit
|
||||
from vllm.utils.network_utils import is_valid_ipv6_address
|
||||
from vllm.utils.system_utils import decorate_logs
|
||||
from vllm.v1.engine.exceptions import EngineDeadError
|
||||
from vllm.v1.metrics.prometheus import get_prometheus_registry
|
||||
from vllm.version import __version__ as VLLM_VERSION
|
||||
|
||||
12
vllm/envs.py
12
vllm/envs.py
@ -19,7 +19,7 @@ if TYPE_CHECKING:
|
||||
VLLM_NCCL_SO_PATH: str | None = None
|
||||
LD_LIBRARY_PATH: str | None = None
|
||||
VLLM_USE_TRITON_FLASH_ATTN: bool = True
|
||||
VLLM_V1_USE_PREFILL_DECODE_ATTENTION: bool = False
|
||||
VLLM_V1_USE_PREFILL_DECODE_ATTENTION: bool = True
|
||||
VLLM_FLASH_ATTN_VERSION: int | None = None
|
||||
LOCAL_RANK: int = 0
|
||||
CUDA_VISIBLE_DEVICES: str | None = None
|
||||
@ -99,13 +99,13 @@ if TYPE_CHECKING:
|
||||
VLLM_DISABLED_KERNELS: list[str] = []
|
||||
VLLM_DISABLE_PYNCCL: bool = False
|
||||
VLLM_USE_V1: bool = True
|
||||
VLLM_ROCM_USE_AITER: bool = False
|
||||
VLLM_ROCM_USE_AITER: bool = True
|
||||
VLLM_ROCM_USE_AITER_PAGED_ATTN: bool = False
|
||||
VLLM_ROCM_USE_AITER_LINEAR: bool = True
|
||||
VLLM_ROCM_USE_AITER_MOE: bool = True
|
||||
VLLM_ROCM_USE_AITER_RMSNORM: bool = True
|
||||
VLLM_ROCM_USE_AITER_MLA: bool = True
|
||||
VLLM_ROCM_USE_AITER_MHA: bool = True
|
||||
VLLM_ROCM_USE_AITER_MHA: bool = False
|
||||
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM: bool = False
|
||||
VLLM_ROCM_USE_TRITON_ROPE: bool = False
|
||||
VLLM_ROCM_USE_AITER_FP8BMM: bool = True
|
||||
@ -485,7 +485,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
|
||||
# Use separate prefill and decode kernels for V1 attention instead of
|
||||
# the unified triton kernel.
|
||||
"VLLM_V1_USE_PREFILL_DECODE_ATTENTION": lambda: (
|
||||
os.getenv("VLLM_V1_USE_PREFILL_DECODE_ATTENTION", "False").lower()
|
||||
os.getenv("VLLM_V1_USE_PREFILL_DECODE_ATTENTION", "True").lower()
|
||||
in ("true", "1")
|
||||
),
|
||||
# Force vllm to use a specific flash-attention version (2 or 3), only valid
|
||||
@ -832,7 +832,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
|
||||
# Disable aiter ops unless specifically enabled.
|
||||
# Acts as a parent switch to enable the rest of the other operations.
|
||||
"VLLM_ROCM_USE_AITER": lambda: (
|
||||
os.getenv("VLLM_ROCM_USE_AITER", "False").lower() in ("true", "1")
|
||||
os.getenv("VLLM_ROCM_USE_AITER", "True").lower() in ("true", "1")
|
||||
),
|
||||
# Whether to use aiter paged attention.
|
||||
# By default is disabled.
|
||||
@ -862,7 +862,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
|
||||
# Whether to use aiter mha ops.
|
||||
# By default is enabled.
|
||||
"VLLM_ROCM_USE_AITER_MHA": lambda: (
|
||||
os.getenv("VLLM_ROCM_USE_AITER_MHA", "True").lower() in ("true", "1")
|
||||
os.getenv("VLLM_ROCM_USE_AITER_MHA", "False").lower() in ("true", "1")
|
||||
),
|
||||
# Whether to use aiter fp4 gemm asm.
|
||||
# By default is disabled.
|
||||
|
||||
710
vllm/model_executor/models/gemma3_mm.py
Normal file
710
vllm/model_executor/models/gemma3_mm.py
Normal file
@ -0,0 +1,710 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
import math
|
||||
from collections.abc import Iterable, Mapping, Sequence
|
||||
from typing import Annotated, Any, Literal
|
||||
|
||||
import torch
|
||||
from torch import nn
|
||||
from transformers import BatchFeature, Gemma3Config, Gemma3Processor
|
||||
from transformers.models.gemma3.processing_gemma3 import Gemma3ProcessorKwargs
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.config.multimodal import BaseDummyOptions
|
||||
from vllm.logger import init_logger
|
||||
from vllm.model_executor.layers.layernorm import GemmaRMSNorm
|
||||
from vllm.model_executor.models.module_mapping import MultiModelKeys
|
||||
from vllm.multimodal import MULTIMODAL_REGISTRY
|
||||
from vllm.multimodal.inputs import (
|
||||
MultiModalDataDict,
|
||||
MultiModalFieldConfig,
|
||||
MultiModalKwargsItems,
|
||||
)
|
||||
from vllm.multimodal.parse import ImageProcessorItems, ImageSize, MultiModalDataItems
|
||||
from vllm.multimodal.processing import (
|
||||
BaseMultiModalProcessor,
|
||||
BaseProcessingInfo,
|
||||
MultiModalPromptUpdates,
|
||||
MultiModalPromptUpdatesApplyResult,
|
||||
PlaceholderFeaturesInfo,
|
||||
PromptReplacement,
|
||||
PromptUpdate,
|
||||
PromptUpdateDetails,
|
||||
replace_token_matches,
|
||||
)
|
||||
from vllm.multimodal.profiling import BaseDummyInputsBuilder
|
||||
from vllm.sequence import IntermediateTensors
|
||||
from vllm.utils.tensor_schema import TensorSchema, TensorShape
|
||||
|
||||
from .interfaces import (
|
||||
MultiModalEmbeddings,
|
||||
SupportsLoRA,
|
||||
SupportsMultiModal,
|
||||
SupportsPP,
|
||||
)
|
||||
from .siglip import SiglipVisionModel
|
||||
from .utils import (
|
||||
AutoWeightsLoader,
|
||||
WeightsMapper,
|
||||
init_vllm_registered_model,
|
||||
maybe_prefix,
|
||||
)
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
|
||||
class Gemma3ImagePixelInputs(TensorSchema):
|
||||
"""
|
||||
Dimensions:
|
||||
- p: Number of patches total (over each image over each prompt in the
|
||||
batch)
|
||||
- c: Number of channels (3)
|
||||
- h: Height of each patch
|
||||
- w: Width of each patch
|
||||
- bn: Batch size * number of images
|
||||
"""
|
||||
|
||||
type: Literal["pixel_values"] = "pixel_values"
|
||||
|
||||
pixel_values: Annotated[torch.Tensor, TensorShape("p", 3, "h", "w")]
|
||||
|
||||
num_patches: Annotated[torch.Tensor, TensorShape("bn")]
|
||||
|
||||
|
||||
Gemma3ImageInputs = Gemma3ImagePixelInputs
|
||||
|
||||
|
||||
class Gemma3ProcessingInfo(BaseProcessingInfo):
|
||||
def get_hf_config(self):
|
||||
return self.ctx.get_hf_config(Gemma3Config)
|
||||
|
||||
def get_hf_processor(self, **kwargs: object):
|
||||
return self.ctx.get_hf_processor(Gemma3Processor, **kwargs)
|
||||
|
||||
def get_supported_mm_limits(self) -> Mapping[str, int | None]:
|
||||
return {"image": None}
|
||||
|
||||
def _resolve_image_kwargs(
|
||||
self,
|
||||
processor: Gemma3Processor,
|
||||
keys: set[str],
|
||||
) -> dict[str, Any]:
|
||||
image_processor = processor.image_processor
|
||||
kwargs = processor._merge_kwargs(
|
||||
Gemma3ProcessorKwargs,
|
||||
tokenizer_init_kwargs=processor.tokenizer.init_kwargs,
|
||||
)
|
||||
|
||||
images_kwargs = kwargs["images_kwargs"]
|
||||
|
||||
def _resolve_kw(key: str):
|
||||
val = getattr(image_processor, key)
|
||||
if val is None:
|
||||
val = images_kwargs[key]
|
||||
|
||||
return val
|
||||
|
||||
return {k: _resolve_kw(k) for k in keys}
|
||||
|
||||
def get_num_crops(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
processor: Gemma3Processor | None,
|
||||
) -> int:
|
||||
if processor is None:
|
||||
processor = self.get_hf_processor()
|
||||
|
||||
images_kwargs = self._resolve_image_kwargs(
|
||||
processor,
|
||||
{
|
||||
"do_pan_and_scan",
|
||||
"pan_and_scan_min_crop_size",
|
||||
"pan_and_scan_max_num_crops",
|
||||
"pan_and_scan_min_ratio_to_activate",
|
||||
},
|
||||
)
|
||||
|
||||
do_pan_and_scan = images_kwargs["do_pan_and_scan"]
|
||||
pan_and_scan_min_crop_size = images_kwargs["pan_and_scan_min_crop_size"]
|
||||
pan_and_scan_max_num_crops = images_kwargs["pan_and_scan_max_num_crops"]
|
||||
pan_and_scan_min_ratio_to_activate = images_kwargs[
|
||||
"pan_and_scan_min_ratio_to_activate"
|
||||
]
|
||||
|
||||
if not do_pan_and_scan:
|
||||
return 0
|
||||
|
||||
if envs.VLLM_USE_V1:
|
||||
logger.warning_once(
|
||||
"`do_pan_and_scan=True` has suboptimal results on V1 "
|
||||
"because of the simplified attention pattern being used."
|
||||
)
|
||||
|
||||
# Based on Gemma3ImageProcessor.pan_and_scan
|
||||
if image_width >= image_height:
|
||||
if image_width / image_height < pan_and_scan_min_ratio_to_activate:
|
||||
return 0
|
||||
|
||||
num_crops_w = min(
|
||||
int(math.floor(image_width / pan_and_scan_min_crop_size)),
|
||||
int(math.floor(image_width / image_height + 0.5)),
|
||||
)
|
||||
|
||||
num_crops_w = max(2, num_crops_w)
|
||||
num_crops_w = min(pan_and_scan_max_num_crops, num_crops_w)
|
||||
num_crops_h = 1
|
||||
else:
|
||||
if image_height / image_width < pan_and_scan_min_ratio_to_activate:
|
||||
return 0
|
||||
|
||||
num_crops_h = min(
|
||||
int(math.floor(image_height / pan_and_scan_min_crop_size)),
|
||||
int(math.floor(image_height / image_width + 0.5)),
|
||||
)
|
||||
|
||||
num_crops_h = max(2, num_crops_h)
|
||||
num_crops_h = min(pan_and_scan_max_num_crops, num_crops_h)
|
||||
num_crops_w = 1
|
||||
|
||||
crop_size_w = int(math.ceil(image_width / num_crops_w))
|
||||
crop_size_h = int(math.ceil(image_height / num_crops_h))
|
||||
|
||||
if min(crop_size_w, crop_size_h) < pan_and_scan_min_crop_size:
|
||||
return 0
|
||||
|
||||
return num_crops_w * num_crops_h
|
||||
|
||||
def get_image_repl(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
processor: Gemma3Processor | None,
|
||||
) -> PromptUpdateDetails[str]:
|
||||
if processor is None:
|
||||
processor = self.get_hf_processor()
|
||||
|
||||
boi_token = processor.boi_token
|
||||
|
||||
num_crops = self.get_num_crops(
|
||||
image_width=image_width,
|
||||
image_height=image_height,
|
||||
processor=processor,
|
||||
)
|
||||
|
||||
if num_crops == 0:
|
||||
image_text = boi_token
|
||||
else:
|
||||
crops_image_tokens = " ".join(boi_token for _ in range(num_crops))
|
||||
image_text = (
|
||||
f"Here is the original image {boi_token} and here are some "
|
||||
f"crops to help you see better {crops_image_tokens}"
|
||||
)
|
||||
|
||||
repl_full = image_text.replace(boi_token, processor.full_image_sequence)
|
||||
|
||||
tokenizer = processor.tokenizer
|
||||
vocab = tokenizer.get_vocab()
|
||||
image_token_id = vocab[tokenizer.image_token]
|
||||
|
||||
return PromptUpdateDetails.select_token_id(repl_full, image_token_id)
|
||||
|
||||
def get_num_image_tokens(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
processor: Gemma3Processor | None,
|
||||
) -> int:
|
||||
if processor is None:
|
||||
processor = self.get_hf_processor()
|
||||
|
||||
num_crops = self.get_num_crops(
|
||||
image_width=image_width,
|
||||
image_height=image_height,
|
||||
processor=processor,
|
||||
)
|
||||
image_seq_len = processor.image_seq_length
|
||||
|
||||
return (num_crops + 1) * image_seq_len
|
||||
|
||||
def get_image_size_with_most_features(self) -> ImageSize:
|
||||
processor = self.get_hf_processor()
|
||||
|
||||
images_kwargs = self._resolve_image_kwargs(
|
||||
processor, {"pan_and_scan_max_num_crops"}
|
||||
)
|
||||
max_num_crops = images_kwargs["pan_and_scan_max_num_crops"]
|
||||
|
||||
# Result in the max possible feature size (h:w = max_num_crops:1)
|
||||
return ImageSize(height=50 * max_num_crops, width=50)
|
||||
|
||||
|
||||
class Gemma3DummyInputsBuilder(BaseDummyInputsBuilder[Gemma3ProcessingInfo]):
|
||||
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
processor = self.info.get_hf_processor()
|
||||
image_token = processor.boi_token
|
||||
|
||||
return image_token * num_images
|
||||
|
||||
def get_dummy_mm_data(
|
||||
self,
|
||||
seq_len: int,
|
||||
mm_counts: Mapping[str, int],
|
||||
mm_options: Mapping[str, BaseDummyOptions] | None = None,
|
||||
) -> MultiModalDataDict:
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
target_width, target_height = self.info.get_image_size_with_most_features()
|
||||
|
||||
image_overrides = mm_options.get("image") if mm_options else None
|
||||
|
||||
return {
|
||||
"image": self._get_dummy_images(
|
||||
width=target_width,
|
||||
height=target_height,
|
||||
num_images=num_images,
|
||||
overrides=image_overrides,
|
||||
)
|
||||
}
|
||||
|
||||
|
||||
class Gemma3MultiModalProcessor(BaseMultiModalProcessor[Gemma3ProcessingInfo]):
|
||||
def _call_hf_processor(
|
||||
self,
|
||||
prompt: str,
|
||||
mm_data: Mapping[str, object],
|
||||
mm_kwargs: Mapping[str, object],
|
||||
tok_kwargs: Mapping[str, object],
|
||||
) -> BatchFeature:
|
||||
processed_outputs = super()._call_hf_processor(
|
||||
prompt,
|
||||
mm_data,
|
||||
mm_kwargs,
|
||||
tok_kwargs,
|
||||
)
|
||||
|
||||
# HF processor pops the `num_crops` kwarg, which is needed by vLLM
|
||||
if (images := mm_data.get("images")) is not None:
|
||||
parsed_images = (
|
||||
self._get_data_parser()
|
||||
.parse_mm_data({"image": images})
|
||||
.get_items("image", ImageProcessorItems)
|
||||
)
|
||||
image_sizes = [
|
||||
parsed_images.get_image_size(i) for i in range(len(parsed_images))
|
||||
]
|
||||
hf_processor = self.info.get_hf_processor(**mm_kwargs)
|
||||
|
||||
num_crops = [
|
||||
self.info.get_num_crops(
|
||||
image_width=size.width,
|
||||
image_height=size.height,
|
||||
processor=hf_processor,
|
||||
)
|
||||
for size in image_sizes
|
||||
]
|
||||
processed_outputs["num_patches"] = torch.tensor(num_crops) + 1
|
||||
|
||||
return processed_outputs
|
||||
|
||||
def _get_mm_fields_config(
|
||||
self,
|
||||
hf_inputs: BatchFeature,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
) -> Mapping[str, MultiModalFieldConfig]:
|
||||
num_patches = hf_inputs.get("num_patches", torch.empty(0))
|
||||
|
||||
return dict(
|
||||
pixel_values=MultiModalFieldConfig.flat_from_sizes("image", num_patches),
|
||||
num_patches=MultiModalFieldConfig.batched("image"),
|
||||
)
|
||||
|
||||
def _get_prompt_updates(
|
||||
self,
|
||||
mm_items: MultiModalDataItems,
|
||||
hf_processor_mm_kwargs: Mapping[str, Any],
|
||||
out_mm_kwargs: MultiModalKwargsItems,
|
||||
) -> Sequence[PromptUpdate]:
|
||||
hf_processor = self.info.get_hf_processor(**hf_processor_mm_kwargs)
|
||||
image_token = hf_processor.boi_token
|
||||
|
||||
def get_replacement_gemma3(item_idx: int):
|
||||
images = mm_items.get_items("image", ImageProcessorItems)
|
||||
|
||||
image_size = images.get_image_size(item_idx)
|
||||
return self.info.get_image_repl(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
processor=hf_processor,
|
||||
)
|
||||
|
||||
return [
|
||||
PromptReplacement(
|
||||
modality="image",
|
||||
target=image_token,
|
||||
replacement=get_replacement_gemma3,
|
||||
)
|
||||
]
|
||||
|
||||
def _apply_token_matches(
|
||||
self,
|
||||
prompt: list[int],
|
||||
mm_prompt_updates: MultiModalPromptUpdates,
|
||||
) -> tuple[list[int], MultiModalPromptUpdatesApplyResult]:
|
||||
token_ids, res = super()._apply_token_matches(prompt, mm_prompt_updates)
|
||||
|
||||
# "\n\n\n" and "\n\n\n\n" are single tokens
|
||||
# Since our replacement can insert "\n\n" next to "\n"
|
||||
# tokens, we have to combine them to be consistent with
|
||||
# the output of the tokenizer
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
vocab = tokenizer.get_vocab()
|
||||
newline_1 = vocab["\n"]
|
||||
newline_2 = vocab["\n\n"]
|
||||
newline_3 = vocab["\n\n\n"]
|
||||
newline_4 = vocab["\n\n\n\n"]
|
||||
|
||||
token_ids = replace_token_matches(
|
||||
token_ids,
|
||||
[newline_1, newline_2],
|
||||
[newline_3],
|
||||
)
|
||||
token_ids = replace_token_matches(
|
||||
token_ids,
|
||||
[newline_2, newline_1],
|
||||
[newline_3],
|
||||
)
|
||||
token_ids = replace_token_matches(
|
||||
token_ids,
|
||||
[newline_2, newline_2],
|
||||
[newline_4],
|
||||
)
|
||||
|
||||
return token_ids, res
|
||||
|
||||
def _find_mm_placeholders(
|
||||
self,
|
||||
new_token_ids: list[int],
|
||||
mm_prompt_updates: MultiModalPromptUpdates,
|
||||
) -> Mapping[str, list[PlaceholderFeaturesInfo]]:
|
||||
# We need to detect "\n\n" inside "\n\n\n" and "\n\n\n\n"
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
vocab = tokenizer.get_vocab()
|
||||
newline_1 = vocab["\n"]
|
||||
newline_2 = vocab["\n\n"]
|
||||
newline_3 = vocab["\n\n\n"]
|
||||
newline_4 = vocab["\n\n\n\n"]
|
||||
|
||||
def get_repl_toks(tok: int) -> list[int]:
|
||||
if tok == newline_3:
|
||||
return [newline_1, newline_2]
|
||||
if tok == newline_4:
|
||||
return [newline_2, newline_2]
|
||||
|
||||
return [tok]
|
||||
|
||||
repl_token_ids = list[int]()
|
||||
repl_orig_idxs = list[int]()
|
||||
for orig_idx, orig_tok in enumerate(new_token_ids):
|
||||
repl_toks = get_repl_toks(orig_tok)
|
||||
repl_token_ids.extend(repl_toks)
|
||||
repl_orig_idxs.extend(orig_idx for _ in range(len(repl_toks)))
|
||||
|
||||
repls = super()._find_mm_placeholders(repl_token_ids, mm_prompt_updates)
|
||||
|
||||
return {
|
||||
modality: [
|
||||
PlaceholderFeaturesInfo(
|
||||
modality=p.modality,
|
||||
item_idx=p.item_idx,
|
||||
start_idx=repl_orig_idxs[p.start_idx],
|
||||
tokens=p.tokens,
|
||||
is_embed=p.is_embed,
|
||||
)
|
||||
for p in placeholders
|
||||
]
|
||||
for modality, placeholders in repls.items()
|
||||
}
|
||||
|
||||
|
||||
class Gemma3MultiModalProjector(nn.Module):
|
||||
def __init__(self, config: Gemma3Config):
|
||||
super().__init__()
|
||||
|
||||
self.mm_input_projection_weight = nn.Parameter(
|
||||
torch.zeros(
|
||||
config.vision_config.hidden_size, config.text_config.hidden_size
|
||||
)
|
||||
)
|
||||
|
||||
self.mm_soft_emb_norm = GemmaRMSNorm(
|
||||
config.vision_config.hidden_size, eps=config.vision_config.layer_norm_eps
|
||||
)
|
||||
|
||||
self.patches_per_image = int(
|
||||
config.vision_config.image_size // config.vision_config.patch_size
|
||||
)
|
||||
self.tokens_per_side = int(config.mm_tokens_per_image**0.5)
|
||||
self.kernel_size = self.patches_per_image // self.tokens_per_side
|
||||
self.avg_pool = nn.AvgPool2d(
|
||||
kernel_size=self.kernel_size, stride=self.kernel_size
|
||||
)
|
||||
|
||||
def forward(self, vision_outputs: torch.Tensor):
|
||||
batch_size, _, seq_length = vision_outputs.shape
|
||||
|
||||
reshaped_vision_outputs = vision_outputs.transpose(1, 2)
|
||||
reshaped_vision_outputs = reshaped_vision_outputs.reshape(
|
||||
batch_size, seq_length, self.patches_per_image, self.patches_per_image
|
||||
)
|
||||
reshaped_vision_outputs = reshaped_vision_outputs.contiguous()
|
||||
|
||||
pooled_vision_outputs = self.avg_pool(reshaped_vision_outputs)
|
||||
pooled_vision_outputs = pooled_vision_outputs.flatten(2)
|
||||
pooled_vision_outputs = pooled_vision_outputs.transpose(1, 2)
|
||||
|
||||
normed_vision_outputs = self.mm_soft_emb_norm(pooled_vision_outputs)
|
||||
|
||||
projected_vision_outputs = torch.matmul(
|
||||
normed_vision_outputs, self.mm_input_projection_weight
|
||||
)
|
||||
return projected_vision_outputs.type_as(vision_outputs)
|
||||
|
||||
|
||||
@MULTIMODAL_REGISTRY.register_processor(
|
||||
Gemma3MultiModalProcessor,
|
||||
info=Gemma3ProcessingInfo,
|
||||
dummy_inputs=Gemma3DummyInputsBuilder,
|
||||
)
|
||||
class Gemma3ForConditionalGeneration(
|
||||
nn.Module, SupportsMultiModal, SupportsPP, SupportsLoRA
|
||||
):
|
||||
merge_by_field_config = True
|
||||
|
||||
packed_modules_mapping = {
|
||||
"qkv_proj": [
|
||||
"q_proj",
|
||||
"k_proj",
|
||||
"v_proj",
|
||||
],
|
||||
"gate_up_proj": [
|
||||
"gate_proj",
|
||||
"up_proj",
|
||||
],
|
||||
}
|
||||
|
||||
hf_to_vllm_mapper = WeightsMapper(
|
||||
orig_to_new_prefix={
|
||||
# mapping for new names in checkpoint saved after transformers v4.52
|
||||
"model.language_model.": "language_model.model.",
|
||||
"model.vision_tower.": "vision_tower.",
|
||||
"model.multi_modal_projector.": "multi_modal_projector.",
|
||||
"lm_head.": "language_model.lm_head.",
|
||||
}
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def get_placeholder_str(cls, modality: str, i: int) -> str | None:
|
||||
if modality.startswith("image"):
|
||||
return "<start_of_image>"
|
||||
|
||||
raise ValueError("Only image modality is supported")
|
||||
|
||||
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
|
||||
super().__init__()
|
||||
config = vllm_config.model_config.hf_config
|
||||
quant_config = vllm_config.quant_config
|
||||
multimodal_config = vllm_config.model_config.multimodal_config
|
||||
self.config = config
|
||||
self.quant_config = quant_config
|
||||
self.multimodal_config = multimodal_config
|
||||
|
||||
self.vision_tower = SiglipVisionModel(
|
||||
config.vision_config,
|
||||
quant_config,
|
||||
prefix=maybe_prefix(prefix, "vision_tower"),
|
||||
)
|
||||
self.multi_modal_projector = Gemma3MultiModalProjector(config)
|
||||
|
||||
self.language_model = init_vllm_registered_model(
|
||||
vllm_config=vllm_config,
|
||||
hf_config=config.text_config,
|
||||
prefix=maybe_prefix(prefix, "language_model"),
|
||||
architectures=["Gemma3ForCausalLM"],
|
||||
)
|
||||
logit_scale = getattr(config, "logit_scale", 1.0)
|
||||
|
||||
if hasattr(self.language_model, "logits_processor"):
|
||||
# The logits processor can be unset if we're using
|
||||
# automatic conversion to pooling model.
|
||||
self.language_model.logits_processor.scale *= logit_scale
|
||||
|
||||
self.make_empty_intermediate_tensors = (
|
||||
self.language_model.make_empty_intermediate_tensors
|
||||
)
|
||||
|
||||
@property
|
||||
def dtype(self):
|
||||
return next(self.parameters()).dtype
|
||||
|
||||
def _parse_and_validate_image_input(
|
||||
self, **kwargs: object
|
||||
) -> Gemma3ImageInputs | None:
|
||||
pixel_values = kwargs.pop("pixel_values", None)
|
||||
num_patches = kwargs.pop("num_patches", None)
|
||||
image_embeds = kwargs.pop("image_embeds", None)
|
||||
assert image_embeds is None, "Gemma3 does not support image_embeds."
|
||||
if pixel_values is None:
|
||||
return None
|
||||
|
||||
image_size = self.config.vision_config.image_size
|
||||
|
||||
return Gemma3ImagePixelInputs(
|
||||
pixel_values=pixel_values,
|
||||
num_patches=num_patches,
|
||||
resolve_bindings={"h": image_size, "w": image_size},
|
||||
)
|
||||
|
||||
def _image_pixels_to_features(
|
||||
self,
|
||||
vision_tower: SiglipVisionModel,
|
||||
pixel_values: torch.Tensor,
|
||||
) -> torch.Tensor:
|
||||
return vision_tower(pixel_values)
|
||||
|
||||
def _process_image_input(
|
||||
self,
|
||||
image_input: Gemma3ImageInputs,
|
||||
) -> list[torch.Tensor]:
|
||||
assert self.vision_tower is not None
|
||||
|
||||
pixel_values = image_input["pixel_values"]
|
||||
num_patches = image_input["num_patches"]
|
||||
|
||||
image_features = self._image_pixels_to_features(
|
||||
self.vision_tower,
|
||||
pixel_values,
|
||||
)
|
||||
image_embeds = self.multi_modal_projector(image_features)
|
||||
|
||||
return [e.flatten(0, 1) for e in image_embeds.split(num_patches.tolist())]
|
||||
|
||||
def get_language_model(self) -> torch.nn.Module:
|
||||
return self.language_model
|
||||
|
||||
def get_multimodal_embeddings(self, **kwargs: object) -> MultiModalEmbeddings:
|
||||
image_input = self._parse_and_validate_image_input(**kwargs)
|
||||
if image_input is None:
|
||||
return []
|
||||
|
||||
return self._process_image_input(image_input)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
positions: torch.Tensor,
|
||||
intermediate_tensors: IntermediateTensors | None = None,
|
||||
inputs_embeds: torch.Tensor | None = None,
|
||||
**kwargs: object,
|
||||
) -> IntermediateTensors:
|
||||
if intermediate_tensors is not None:
|
||||
inputs_embeds = None
|
||||
|
||||
hidden_states = self.language_model.model(
|
||||
input_ids,
|
||||
positions,
|
||||
intermediate_tensors,
|
||||
inputs_embeds=inputs_embeds,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return hidden_states
|
||||
|
||||
def prepare_attn_masks(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
positions: torch.Tensor,
|
||||
mask_dtype: torch.dtype,
|
||||
**kwargs,
|
||||
):
|
||||
kwargs["has_images"] = True
|
||||
# NOTE(woosuk): Here, we distinguish the sequences by the position id 0.
|
||||
# This is a HACK. Fix this.
|
||||
start_indices = (positions == 0).cpu().nonzero()
|
||||
num_seqs = len(start_indices)
|
||||
seq_lens = []
|
||||
for i in range(num_seqs):
|
||||
start_idx = start_indices[i].item()
|
||||
if i < num_seqs - 1:
|
||||
end_idx = start_indices[i + 1].item()
|
||||
else:
|
||||
end_idx = len(input_ids)
|
||||
seq_lens.append(end_idx - start_idx)
|
||||
kwargs["seq_lens"] = seq_lens
|
||||
|
||||
global_attn_masks = []
|
||||
local_attn_masks = []
|
||||
start_idx = 0
|
||||
for seq_len in seq_lens:
|
||||
end_idx = start_idx + seq_len
|
||||
input_token_ids = input_ids[start_idx:end_idx]
|
||||
start_idx = end_idx
|
||||
# Create a global causal mask.
|
||||
global_attn_mask = torch.empty(
|
||||
1,
|
||||
1,
|
||||
seq_len,
|
||||
seq_len,
|
||||
dtype=mask_dtype,
|
||||
device=input_ids.device,
|
||||
)
|
||||
global_attn_mask.fill_(float("-inf"))
|
||||
# Fill the lower triangle with 0.
|
||||
global_attn_mask = global_attn_mask.triu(diagonal=1)
|
||||
|
||||
# Consider the bidirectional attention between image tokens.
|
||||
img_mask = torch.zeros_like(global_attn_mask)
|
||||
img_pos = input_token_ids == self.config.image_token_index
|
||||
img_mask[:, :, :, img_pos] += 1
|
||||
img_mask[:, :, img_pos, :] += 1
|
||||
global_attn_mask = torch.where(img_mask == 2, 0, global_attn_mask)
|
||||
global_attn_masks.append(global_attn_mask)
|
||||
|
||||
sliding_window = self.config.text_config.sliding_window
|
||||
if sliding_window is not None:
|
||||
# Create a local causal mask with sliding window (1024).
|
||||
local_attn_mask = torch.ones_like(global_attn_mask)
|
||||
local_attn_mask = torch.tril(local_attn_mask, diagonal=-sliding_window)
|
||||
local_attn_mask = torch.where(
|
||||
local_attn_mask == 0, global_attn_mask, float("-inf")
|
||||
)
|
||||
local_attn_masks.append(local_attn_mask)
|
||||
kwargs["global_attn_masks"] = global_attn_masks
|
||||
kwargs["local_attn_masks"] = local_attn_masks
|
||||
return kwargs
|
||||
|
||||
def compute_logits(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
) -> torch.Tensor | None:
|
||||
return self.language_model.compute_logits(hidden_states)
|
||||
|
||||
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
|
||||
loader = AutoWeightsLoader(self)
|
||||
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
|
||||
|
||||
def get_mm_mapping(self) -> MultiModelKeys:
|
||||
"""
|
||||
Get the module prefix in multimodal models
|
||||
"""
|
||||
return MultiModelKeys.from_string_field(
|
||||
language_model="language_model",
|
||||
connector="multi_modal_projector",
|
||||
tower_model="vision_tower",
|
||||
)
|
||||
412
vllm/model_executor/models/paligemma.py
Normal file
412
vllm/model_executor/models/paligemma.py
Normal file
@ -0,0 +1,412 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
from collections.abc import Iterable, Mapping, Sequence
|
||||
from typing import Annotated, Literal, TypeAlias
|
||||
|
||||
import torch
|
||||
from torch import nn
|
||||
from transformers import BatchFeature, PaliGemmaConfig
|
||||
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.config.multimodal import BaseDummyOptions
|
||||
from vllm.logger import init_logger
|
||||
from vllm.multimodal import MULTIMODAL_REGISTRY
|
||||
from vllm.multimodal.inputs import (
|
||||
MultiModalDataDict,
|
||||
MultiModalFieldConfig,
|
||||
MultiModalInputs,
|
||||
MultiModalKwargsItems,
|
||||
MultiModalUUIDDict,
|
||||
)
|
||||
from vllm.multimodal.parse import (
|
||||
ImageEmbeddingItems,
|
||||
ImageProcessorItems,
|
||||
MultiModalDataItems,
|
||||
)
|
||||
from vllm.multimodal.processing import (
|
||||
BaseMultiModalProcessor,
|
||||
BaseProcessingInfo,
|
||||
PromptIndexTargets,
|
||||
PromptInsertion,
|
||||
PromptUpdate,
|
||||
PromptUpdateDetails,
|
||||
)
|
||||
from vllm.multimodal.profiling import BaseDummyInputsBuilder
|
||||
from vllm.sequence import IntermediateTensors
|
||||
from vllm.utils.tensor_schema import TensorSchema, TensorShape
|
||||
|
||||
from .interfaces import MultiModalEmbeddings, SupportsMultiModal, SupportsPP
|
||||
from .siglip import SiglipVisionModel
|
||||
from .utils import (
|
||||
AutoWeightsLoader,
|
||||
WeightsMapper,
|
||||
flatten_bn,
|
||||
init_vllm_registered_model,
|
||||
maybe_prefix,
|
||||
)
|
||||
from .vision import get_vision_encoder_info
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
|
||||
class PaliGemmaImagePixelInputs(TensorSchema):
|
||||
"""
|
||||
Dimensions:
|
||||
- bn: Batch size * number of images
|
||||
- c: Number of channels (3)
|
||||
- h: Height
|
||||
- w: Width
|
||||
"""
|
||||
|
||||
type: Literal["pixel_values"] = "pixel_values"
|
||||
data: Annotated[torch.Tensor, TensorShape("bn", 3, "h", "w")]
|
||||
|
||||
|
||||
class PaliGemmaImageEmbeddingInputs(TensorSchema):
|
||||
"""
|
||||
Dimensions:
|
||||
- bn: Batch size * number of images
|
||||
- ifs: Image feature size
|
||||
- hs: Hidden size (must match language model backbone)
|
||||
"""
|
||||
|
||||
type: Literal["image_embeds"] = "image_embeds"
|
||||
data: Annotated[torch.Tensor, TensorShape("bn", "ifs", "hs")]
|
||||
|
||||
|
||||
PaliGemmaImageInputs: TypeAlias = (
|
||||
PaliGemmaImagePixelInputs | PaliGemmaImageEmbeddingInputs
|
||||
)
|
||||
|
||||
|
||||
class PaliGemmaMultiModalProjector(nn.Module):
|
||||
def __init__(self, vision_hidden_size: int, projection_dim: int):
|
||||
super().__init__()
|
||||
|
||||
self.linear = nn.Linear(vision_hidden_size, projection_dim, bias=True)
|
||||
|
||||
def forward(self, image_features: torch.Tensor) -> torch.Tensor:
|
||||
hidden_states = self.linear(image_features)
|
||||
return hidden_states
|
||||
|
||||
|
||||
class PaliGemmaProcessingInfo(BaseProcessingInfo):
|
||||
def get_hf_config(self):
|
||||
return self.ctx.get_hf_config(PaliGemmaConfig)
|
||||
|
||||
def get_vision_encoder_info(self):
|
||||
return get_vision_encoder_info(self.get_hf_config())
|
||||
|
||||
def get_supported_mm_limits(self) -> Mapping[str, int | None]:
|
||||
return {"image": 1}
|
||||
|
||||
def get_num_image_tokens(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
) -> int:
|
||||
vision_encoder_info = self.get_vision_encoder_info()
|
||||
|
||||
return vision_encoder_info.get_num_image_tokens(
|
||||
image_width=image_width,
|
||||
image_height=image_height,
|
||||
)
|
||||
|
||||
|
||||
class PaliGemmaDummyInputsBuilder(BaseDummyInputsBuilder[PaliGemmaProcessingInfo]):
|
||||
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
|
||||
return ""
|
||||
|
||||
def get_dummy_mm_data(
|
||||
self,
|
||||
seq_len: int,
|
||||
mm_counts: Mapping[str, int],
|
||||
mm_options: Mapping[str, BaseDummyOptions] | None = None,
|
||||
) -> MultiModalDataDict:
|
||||
hf_config = self.info.get_hf_config()
|
||||
vision_config = hf_config.vision_config
|
||||
max_image_size = vision_config.image_size
|
||||
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
image_overrides = mm_options.get("image") if mm_options else None
|
||||
|
||||
return {
|
||||
"image": self._get_dummy_images(
|
||||
width=max_image_size,
|
||||
height=max_image_size,
|
||||
num_images=num_images,
|
||||
overrides=image_overrides,
|
||||
)
|
||||
}
|
||||
|
||||
|
||||
class PaliGemmaMultiModalProcessor(BaseMultiModalProcessor[PaliGemmaProcessingInfo]):
|
||||
def _call_hf_processor(
|
||||
self,
|
||||
prompt: str,
|
||||
mm_data: Mapping[str, object],
|
||||
mm_kwargs: Mapping[str, object],
|
||||
tok_kwargs: Mapping[str, object],
|
||||
) -> BatchFeature:
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
if not mm_data:
|
||||
prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
|
||||
return BatchFeature(dict(input_ids=[prompt_ids]), tensor_type="pt")
|
||||
|
||||
return super()._call_hf_processor(
|
||||
prompt=prompt,
|
||||
mm_data=mm_data,
|
||||
mm_kwargs=mm_kwargs,
|
||||
tok_kwargs=tok_kwargs,
|
||||
)
|
||||
|
||||
def _get_mm_fields_config(
|
||||
self,
|
||||
hf_inputs: BatchFeature,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
) -> Mapping[str, MultiModalFieldConfig]:
|
||||
return dict(pixel_values=MultiModalFieldConfig.batched("image"))
|
||||
|
||||
def _get_prompt_updates(
|
||||
self,
|
||||
mm_items: MultiModalDataItems,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
out_mm_kwargs: MultiModalKwargsItems,
|
||||
) -> Sequence[PromptUpdate]:
|
||||
hf_config = self.info.get_hf_config()
|
||||
image_token_id = hf_config.image_token_index
|
||||
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
|
||||
bos_token_id = tokenizer.bos_token_id
|
||||
assert isinstance(bos_token_id, int)
|
||||
|
||||
def get_insertion(item_idx: int):
|
||||
images = mm_items.get_items(
|
||||
"image", (ImageEmbeddingItems, ImageProcessorItems)
|
||||
)
|
||||
|
||||
if isinstance(images, ImageEmbeddingItems):
|
||||
num_image_tokens = images.get_feature_size(item_idx)
|
||||
else:
|
||||
image_size = images.get_image_size(item_idx)
|
||||
num_image_tokens = self.info.get_num_image_tokens(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
)
|
||||
|
||||
image_tokens = [image_token_id] * num_image_tokens
|
||||
|
||||
return PromptUpdateDetails.select_token_id(
|
||||
image_tokens + [bos_token_id],
|
||||
embed_token_id=image_token_id,
|
||||
)
|
||||
|
||||
# Paligemma 1 and 2 have different tokenizer.add_bos_token
|
||||
# Insert <image>*n + <bos> after <bos> for Paligemma 1
|
||||
# Insert <image>*n + <bos> for Paligemma 2
|
||||
return [
|
||||
PromptInsertion(
|
||||
modality="image",
|
||||
target=PromptIndexTargets.prefix(
|
||||
[bos_token_id] if tokenizer.add_bos_token else []
|
||||
),
|
||||
insertion=get_insertion,
|
||||
)
|
||||
]
|
||||
|
||||
def apply(
|
||||
self,
|
||||
prompt: str | list[int],
|
||||
mm_data: MultiModalDataDict,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
tokenization_kwargs: Mapping[str, object] | None = None,
|
||||
mm_uuids: MultiModalUUIDDict | None = None,
|
||||
) -> MultiModalInputs:
|
||||
mm_inputs = super().apply(
|
||||
prompt,
|
||||
mm_data,
|
||||
hf_processor_mm_kwargs,
|
||||
tokenization_kwargs,
|
||||
mm_uuids=mm_uuids,
|
||||
)
|
||||
prompt_token_ids = mm_inputs["prompt_token_ids"]
|
||||
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
newline_prompt = "\n"
|
||||
newline_token_id = tokenizer.encode(newline_prompt)[-1] # 108
|
||||
# Force to add newline at the end of prompt for paligemma's format
|
||||
# This step can NOT be replacemented by current PromptUpdate methods
|
||||
if len(prompt_token_ids) and prompt_token_ids[-1] != newline_token_id:
|
||||
prompt_token_ids.append(newline_token_id)
|
||||
mm_inputs["prompt_token_ids"] = prompt_token_ids
|
||||
|
||||
return mm_inputs
|
||||
|
||||
|
||||
@MULTIMODAL_REGISTRY.register_processor(
|
||||
PaliGemmaMultiModalProcessor,
|
||||
info=PaliGemmaProcessingInfo,
|
||||
dummy_inputs=PaliGemmaDummyInputsBuilder,
|
||||
)
|
||||
class PaliGemmaForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP):
|
||||
packed_modules_mapping = {
|
||||
"qkv_proj": [
|
||||
"q_proj",
|
||||
"k_proj",
|
||||
"v_proj",
|
||||
],
|
||||
"gate_up_proj": [
|
||||
"gate_proj",
|
||||
"up_proj",
|
||||
],
|
||||
}
|
||||
|
||||
hf_to_vllm_mapper = WeightsMapper(
|
||||
orig_to_new_prefix={
|
||||
# mapping for new names in checkpoint saved after transformers v4.52
|
||||
"model.language_model.": "language_model.model.",
|
||||
"model.vision_tower.": "vision_tower.",
|
||||
"model.multi_modal_projector.": "multi_modal_projector.",
|
||||
"lm_head.": "language_model.lm_head.",
|
||||
}
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def get_placeholder_str(cls, modality: str, i: int) -> str | None:
|
||||
if modality.startswith("image"):
|
||||
return None
|
||||
|
||||
raise ValueError("Only image modality is supported")
|
||||
|
||||
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
|
||||
super().__init__()
|
||||
config = vllm_config.model_config.hf_config
|
||||
quant_config = vllm_config.quant_config
|
||||
multimodal_config = vllm_config.model_config.multimodal_config
|
||||
self.config = config
|
||||
self.multimodal_config = multimodal_config
|
||||
|
||||
self.vision_tower = SiglipVisionModel(
|
||||
config.vision_config,
|
||||
quant_config,
|
||||
prefix=maybe_prefix(prefix, "vision_tower"),
|
||||
)
|
||||
self.multi_modal_projector = PaliGemmaMultiModalProjector(
|
||||
vision_hidden_size=config.vision_config.hidden_size,
|
||||
projection_dim=config.vision_config.projection_dim,
|
||||
)
|
||||
|
||||
self.quant_config = quant_config
|
||||
|
||||
if config.text_config.model_type == "gemma":
|
||||
config.text_config.architectures = ["GemmaForCausalLM"]
|
||||
else:
|
||||
config.text_config.architectures = ["Gemma2ForCausalLM"]
|
||||
self.language_model = init_vllm_registered_model(
|
||||
vllm_config=vllm_config,
|
||||
hf_config=config.text_config,
|
||||
prefix=maybe_prefix(prefix, "language_model"),
|
||||
)
|
||||
logit_scale = getattr(config, "logit_scale", 1.0)
|
||||
self.language_model.logits_processor.scale *= logit_scale
|
||||
|
||||
self.make_empty_intermediate_tensors = (
|
||||
self.language_model.make_empty_intermediate_tensors
|
||||
)
|
||||
|
||||
def _parse_and_validate_image_input(
|
||||
self, **kwargs: object
|
||||
) -> PaliGemmaImageInputs | None:
|
||||
pixel_values = kwargs.pop("pixel_values", None)
|
||||
image_embeds = kwargs.pop("image_embeds", None)
|
||||
|
||||
if pixel_values is None and image_embeds is None:
|
||||
return None
|
||||
|
||||
if pixel_values is not None:
|
||||
pixel_values = flatten_bn(pixel_values, concat=True)
|
||||
|
||||
h = w = self.config.vision_config.image_size
|
||||
return PaliGemmaImagePixelInputs(
|
||||
type="pixel_values",
|
||||
data=pixel_values,
|
||||
resolve_bindings={"h": h, "w": w},
|
||||
)
|
||||
|
||||
if image_embeds is not None:
|
||||
image_embeds = flatten_bn(image_embeds, concat=True)
|
||||
|
||||
return PaliGemmaImageEmbeddingInputs(
|
||||
type="image_embeds",
|
||||
data=image_embeds,
|
||||
)
|
||||
|
||||
raise AssertionError("This line should be unreachable.")
|
||||
|
||||
def _image_pixels_to_features(
|
||||
self,
|
||||
vision_tower: SiglipVisionModel,
|
||||
pixel_values: torch.Tensor,
|
||||
) -> torch.Tensor:
|
||||
target_dtype = vision_tower.get_input_embeddings().weight.dtype
|
||||
image_features = vision_tower(pixel_values.to(dtype=target_dtype))
|
||||
|
||||
return image_features
|
||||
|
||||
def _process_image_input(
|
||||
self,
|
||||
image_input: PaliGemmaImageInputs,
|
||||
) -> torch.Tensor:
|
||||
if image_input["type"] == "image_embeds":
|
||||
return image_input["data"]
|
||||
|
||||
assert self.vision_tower is not None
|
||||
pixel_values = image_input["data"]
|
||||
image_features = self._image_pixels_to_features(
|
||||
self.vision_tower,
|
||||
pixel_values,
|
||||
)
|
||||
|
||||
return self.multi_modal_projector(image_features)
|
||||
|
||||
def get_language_model(self) -> torch.nn.Module:
|
||||
return self.language_model
|
||||
|
||||
def get_multimodal_embeddings(self, **kwargs: object) -> MultiModalEmbeddings:
|
||||
image_input = self._parse_and_validate_image_input(**kwargs)
|
||||
if image_input is None:
|
||||
return []
|
||||
vision_embeddings = self._process_image_input(image_input)
|
||||
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/paligemma/modeling_paligemma.py#L294 # noqa
|
||||
vision_embeddings = vision_embeddings * (self.config.hidden_size**-0.5)
|
||||
return vision_embeddings
|
||||
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
positions: torch.Tensor,
|
||||
intermediate_tensors: IntermediateTensors | None = None,
|
||||
inputs_embeds: torch.Tensor | None = None,
|
||||
**kwargs: object,
|
||||
) -> IntermediateTensors:
|
||||
if intermediate_tensors is not None:
|
||||
inputs_embeds = None
|
||||
|
||||
hidden_states = self.language_model.model(
|
||||
input_ids, positions, intermediate_tensors, inputs_embeds=inputs_embeds
|
||||
)
|
||||
|
||||
return hidden_states
|
||||
|
||||
def compute_logits(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
) -> torch.Tensor | None:
|
||||
return self.language_model.compute_logits(hidden_states)
|
||||
|
||||
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
|
||||
loader = AutoWeightsLoader(self)
|
||||
return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
|
||||
@ -265,6 +265,7 @@ _MULTIMODAL_MODELS = {
|
||||
"Ernie4_5_VLMoeForConditionalGeneration",
|
||||
),
|
||||
"FuyuForCausalLM": ("fuyu", "FuyuForCausalLM"),
|
||||
"Gemma3ForConditionalGeneration": ("gemma3_mm", "Gemma3ForConditionalGeneration"), # noqa: E501
|
||||
"Gemma3nForConditionalGeneration": (
|
||||
"gemma3n_mm",
|
||||
"Gemma3nForConditionalGeneration",
|
||||
@ -334,6 +335,10 @@ _MULTIMODAL_MODELS = {
|
||||
"NVLM_D": ("nvlm_d", "NVLM_D_Model"),
|
||||
"Ovis": ("ovis", "Ovis"),
|
||||
"Ovis2_5": ("ovis2_5", "Ovis2_5"),
|
||||
"PaliGemmaForConditionalGeneration": (
|
||||
"paligemma",
|
||||
"PaliGemmaForConditionalGeneration",
|
||||
),
|
||||
"Phi3VForCausalLM": ("phi3v", "Phi3VForCausalLM"),
|
||||
"Phi4MMForCausalLM": ("phi4mm", "Phi4MMForCausalLM"),
|
||||
"Phi4MultimodalForCausalLM": ("phi4_multimodal", "Phi4MultimodalForCausalLM"), # noqa: E501
|
||||
@ -406,14 +411,6 @@ _TRANSFORMERS_SUPPORTED_MODELS = {
|
||||
"transformers",
|
||||
"TransformersMultiModalForCausalLM",
|
||||
),
|
||||
"Gemma3ForConditionalGeneration": (
|
||||
"transformers",
|
||||
"TransformersMultiModalForCausalLM",
|
||||
),
|
||||
"PaliGemmaForConditionalGeneration": (
|
||||
"transformers",
|
||||
"TransformersMultiModalForCausalLM",
|
||||
),
|
||||
}
|
||||
|
||||
_TRANSFORMERS_BACKEND_MODELS = {
|
||||
|
||||
@ -59,6 +59,9 @@ _ROCM_PARTIALLY_SUPPORTED_MODELS: dict[str, str] = {
|
||||
"Qwen2ForCausalLM": _ROCM_SWA_REASON,
|
||||
"MistralForCausalLM": _ROCM_SWA_REASON,
|
||||
"MixtralForCausalLM": _ROCM_SWA_REASON,
|
||||
"PaliGemmaForConditionalGeneration": (
|
||||
"ROCm flash attention does not yet fully support 32-bit precision on PaliGemma"
|
||||
),
|
||||
"Phi3VForCausalLM": (
|
||||
"ROCm Triton flash attention may run into compilation errors due to "
|
||||
"excessive use of shared memory. If this happens, disable Triton FA "
|
||||
|
||||
@ -5,6 +5,7 @@ import contextlib
|
||||
import datetime
|
||||
import enum
|
||||
import getpass
|
||||
import importlib.util
|
||||
import inspect
|
||||
import json
|
||||
import multiprocessing
|
||||
@ -33,13 +34,11 @@ from collections.abc import (
|
||||
)
|
||||
from concurrent.futures.process import ProcessPoolExecutor
|
||||
from functools import cache, partial, wraps
|
||||
from pathlib import Path
|
||||
from typing import TYPE_CHECKING, Any, TextIO, TypeVar
|
||||
from typing import TYPE_CHECKING, Any, TypeVar
|
||||
|
||||
import cloudpickle
|
||||
import psutil
|
||||
import regex as re
|
||||
import setproctitle
|
||||
import torch
|
||||
import yaml
|
||||
|
||||
@ -144,18 +143,6 @@ def random_uuid() -> str:
|
||||
return str(uuid.uuid4().hex)
|
||||
|
||||
|
||||
def update_environment_variables(envs: dict[str, str]):
|
||||
for k, v in envs.items():
|
||||
if k in os.environ and os.environ[k] != v:
|
||||
logger.warning(
|
||||
"Overwriting environment variable %s from '%s' to '%s'",
|
||||
k,
|
||||
os.environ[k],
|
||||
v,
|
||||
)
|
||||
os.environ[k] = v
|
||||
|
||||
|
||||
def cdiv(a: int, b: int) -> int:
|
||||
"""Ceiling division."""
|
||||
return -(a // -b)
|
||||
@ -1061,70 +1048,44 @@ def check_use_alibi(model_config: ModelConfig) -> bool:
|
||||
)
|
||||
|
||||
|
||||
def set_process_title(
|
||||
name: str, suffix: str = "", prefix: str = envs.VLLM_PROCESS_NAME_PREFIX
|
||||
) -> None:
|
||||
@cache
|
||||
def _has_module(module_name: str) -> bool:
|
||||
"""Return True if *module_name* can be found in the current environment.
|
||||
|
||||
The result is cached so that subsequent queries for the same module incur
|
||||
no additional overhead.
|
||||
"""
|
||||
Set the current process title to a specific name with an
|
||||
optional suffix.
|
||||
|
||||
Args:
|
||||
name: The title to assign to the current process.
|
||||
suffix: An optional suffix to append to the base name.
|
||||
prefix: A prefix to prepend to the front separated by `::`.
|
||||
"""
|
||||
if suffix:
|
||||
name = f"{name}_{suffix}"
|
||||
setproctitle.setproctitle(f"{prefix}::{name}")
|
||||
return importlib.util.find_spec(module_name) is not None
|
||||
|
||||
|
||||
def _add_prefix(file: TextIO, worker_name: str, pid: int) -> None:
|
||||
"""Prepend each output line with process-specific prefix"""
|
||||
def has_pplx() -> bool:
|
||||
"""Whether the optional `pplx_kernels` package is available."""
|
||||
|
||||
prefix = f"{CYAN}({worker_name} pid={pid}){RESET} "
|
||||
file_write = file.write
|
||||
|
||||
def write_with_prefix(s: str):
|
||||
if not s:
|
||||
return
|
||||
if file.start_new_line: # type: ignore[attr-defined]
|
||||
file_write(prefix)
|
||||
idx = 0
|
||||
while (next_idx := s.find("\n", idx)) != -1:
|
||||
next_idx += 1
|
||||
file_write(s[idx:next_idx])
|
||||
if next_idx == len(s):
|
||||
file.start_new_line = True # type: ignore[attr-defined]
|
||||
return
|
||||
file_write(prefix)
|
||||
idx = next_idx
|
||||
file_write(s[idx:])
|
||||
file.start_new_line = False # type: ignore[attr-defined]
|
||||
|
||||
file.start_new_line = True # type: ignore[attr-defined]
|
||||
file.write = write_with_prefix # type: ignore[method-assign]
|
||||
return _has_module("pplx_kernels")
|
||||
|
||||
|
||||
def decorate_logs(process_name: str | None = None) -> None:
|
||||
"""
|
||||
Adds a process-specific prefix to each line of output written to stdout and
|
||||
stderr.
|
||||
def has_deep_ep() -> bool:
|
||||
"""Whether the optional `deep_ep` package is available."""
|
||||
|
||||
This function is intended to be called before initializing the api_server,
|
||||
engine_core, or worker classes, so that all subsequent output from the
|
||||
process is prefixed with the process name and PID. This helps distinguish
|
||||
log output from different processes in multi-process environments.
|
||||
return _has_module("deep_ep")
|
||||
|
||||
Args:
|
||||
process_name: Optional; the name of the process to use in the prefix.
|
||||
If not provided, the current process name from the multiprocessing
|
||||
context is used.
|
||||
"""
|
||||
if process_name is None:
|
||||
process_name = get_mp_context().current_process().name
|
||||
pid = os.getpid()
|
||||
_add_prefix(sys.stdout, process_name, pid)
|
||||
_add_prefix(sys.stderr, process_name, pid)
|
||||
|
||||
def has_deep_gemm() -> bool:
|
||||
"""Whether the optional `deep_gemm` package is available."""
|
||||
|
||||
return _has_module("deep_gemm")
|
||||
|
||||
|
||||
def has_triton_kernels() -> bool:
|
||||
"""Whether the optional `triton_kernels` package is available."""
|
||||
|
||||
return _has_module("triton_kernels")
|
||||
|
||||
|
||||
def has_tilelang() -> bool:
|
||||
"""Whether the optional `tilelang` package is available."""
|
||||
|
||||
return _has_module("tilelang")
|
||||
|
||||
|
||||
def length_from_prompt_token_ids_or_embeds(
|
||||
@ -1149,36 +1110,3 @@ def length_from_prompt_token_ids_or_embeds(
|
||||
f" prompt_embeds={prompt_embeds_len}"
|
||||
)
|
||||
return prompt_token_len
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def set_env_var(key, value):
|
||||
old = os.environ.get(key)
|
||||
os.environ[key] = value
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
if old is None:
|
||||
del os.environ[key]
|
||||
else:
|
||||
os.environ[key] = old
|
||||
|
||||
|
||||
def unique_filepath(fn: Callable[[int], Path]) -> Path:
|
||||
"""
|
||||
unique_filepath returns a unique path by trying
|
||||
to include an integer in increasing order.
|
||||
|
||||
fn should be a callable that returns a path that
|
||||
includes the passed int at a fixed location.
|
||||
|
||||
Note: This function has a TOCTOU race condition.
|
||||
Caller should use atomic operations (e.g., open with 'x' mode)
|
||||
when creating the file to ensure thread safety.
|
||||
"""
|
||||
i = 0
|
||||
while True:
|
||||
p = fn(i)
|
||||
if not p.exists():
|
||||
return p
|
||||
i += 1
|
||||
|
||||
123
vllm/utils/system_utils.py
Normal file
123
vllm/utils/system_utils.py
Normal file
@ -0,0 +1,123 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import contextlib
|
||||
import os
|
||||
import sys
|
||||
from collections.abc import Callable, Iterator
|
||||
from pathlib import Path
|
||||
from typing import TextIO
|
||||
|
||||
try:
|
||||
import setproctitle
|
||||
except ImportError:
|
||||
setproctitle = None # type: ignore[assignment]
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm.logger import init_logger
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
CYAN = "\033[1;36m"
|
||||
RESET = "\033[0;0m"
|
||||
|
||||
|
||||
# Environment variable utilities
|
||||
|
||||
|
||||
def update_environment_variables(envs_dict: dict[str, str]):
|
||||
"""Update multiple environment variables with logging."""
|
||||
for k, v in envs_dict.items():
|
||||
if k in os.environ and os.environ[k] != v:
|
||||
logger.warning(
|
||||
"Overwriting environment variable %s from '%s' to '%s'",
|
||||
k,
|
||||
os.environ[k],
|
||||
v,
|
||||
)
|
||||
os.environ[k] = v
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def set_env_var(key: str, value: str) -> Iterator[None]:
|
||||
"""Temporarily set an environment variable."""
|
||||
old = os.environ.get(key)
|
||||
os.environ[key] = value
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
if old is None:
|
||||
os.environ.pop(key, None)
|
||||
else:
|
||||
os.environ[key] = old
|
||||
|
||||
|
||||
# File path utilities
|
||||
|
||||
|
||||
def unique_filepath(fn: Callable[[int], Path]) -> Path:
|
||||
"""Generate a unique file path by trying incrementing integers.
|
||||
|
||||
Note: This function has a TOCTOU race condition.
|
||||
Caller should use atomic operations (e.g., open with 'x' mode)
|
||||
when creating the file to ensure thread safety.
|
||||
"""
|
||||
i = 0
|
||||
while True:
|
||||
p = fn(i)
|
||||
if not p.exists():
|
||||
return p
|
||||
i += 1
|
||||
|
||||
|
||||
# Process management utilities
|
||||
|
||||
|
||||
def set_process_title(
|
||||
name: str, suffix: str = "", prefix: str = envs.VLLM_PROCESS_NAME_PREFIX
|
||||
) -> None:
|
||||
"""Set the current process title with optional suffix."""
|
||||
if setproctitle is None:
|
||||
return
|
||||
if suffix:
|
||||
name = f"{name}_{suffix}"
|
||||
setproctitle.setproctitle(f"{prefix}::{name}")
|
||||
|
||||
|
||||
def _add_prefix(file: TextIO, worker_name: str, pid: int) -> None:
|
||||
"""Add colored prefix to file output for log decoration."""
|
||||
prefix = f"{CYAN}({worker_name} pid={pid}){RESET} "
|
||||
file_write = file.write
|
||||
|
||||
def write_with_prefix(s: str):
|
||||
if not s:
|
||||
return
|
||||
if file.start_new_line: # type: ignore[attr-defined]
|
||||
file_write(prefix)
|
||||
idx = 0
|
||||
while (next_idx := s.find("\n", idx)) != -1:
|
||||
next_idx += 1
|
||||
file_write(s[idx:next_idx])
|
||||
if next_idx == len(s):
|
||||
file.start_new_line = True # type: ignore[attr-defined]
|
||||
return
|
||||
file_write(prefix)
|
||||
idx = next_idx
|
||||
file_write(s[idx:])
|
||||
file.start_new_line = False # type: ignore[attr-defined]
|
||||
|
||||
file.start_new_line = True # type: ignore[attr-defined]
|
||||
file.write = write_with_prefix # type: ignore[method-assign]
|
||||
|
||||
|
||||
def decorate_logs(process_name: str | None = None) -> None:
|
||||
"""Decorate stdout/stderr with process name and PID prefix."""
|
||||
from vllm.utils import get_mp_context
|
||||
|
||||
if process_name is None:
|
||||
process_name = get_mp_context().current_process().name
|
||||
pid = os.getpid()
|
||||
_add_prefix(sys.stdout, process_name, pid)
|
||||
_add_prefix(sys.stderr, process_name, pid)
|
||||
@ -658,7 +658,10 @@ class FlexAttentionMetadataBuilder(AttentionMetadataBuilder[FlexAttentionMetadat
|
||||
total_cache_tokens=total_cache_tokens,
|
||||
decode_offset=offset_tensor,
|
||||
num_blocks_per_seq=num_blocks_per_seq,
|
||||
direct_build=self.direct_build,
|
||||
# FIXME(Isotr0py): direct build has issue to build bidirectional
|
||||
# attention block mask for encoder-only models, disable it temporarily.
|
||||
# see: https://github.com/vllm-project/vllm/pull/27329#issuecomment-3431484053
|
||||
direct_build=(self.direct_build and common_attn_metadata.causal),
|
||||
q_block_size=self.q_block_size,
|
||||
kv_block_size=self.kv_block_size,
|
||||
)
|
||||
|
||||
@ -89,10 +89,9 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
self.use_full_cuda_graph = (
|
||||
self.compilation_config.cudagraph_mode.has_full_cudagraphs()
|
||||
)
|
||||
self.max_cudagraph_size = self.compilation_config.max_capture_size
|
||||
|
||||
if self.use_full_cuda_graph and self.fa_aot_schedule:
|
||||
self.max_cudagraph_size = self.compilation_config.max_capture_size
|
||||
|
||||
if self.max_cudagraph_size > 992:
|
||||
# This condition derives from FA3's internal heuristic.
|
||||
# TODO(woosuk): Support larger cudagraph sizes.
|
||||
@ -114,7 +113,14 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
self.max_num_splits = 1
|
||||
|
||||
def _schedule_decode(
|
||||
self, num_reqs, cu_query_lens, max_query_len, seqlens, max_seq_len, causal
|
||||
self,
|
||||
num_reqs,
|
||||
cu_query_lens,
|
||||
max_query_len,
|
||||
seqlens,
|
||||
max_seq_len,
|
||||
causal,
|
||||
max_num_splits,
|
||||
):
|
||||
if self.fa_aot_schedule:
|
||||
return get_scheduler_metadata(
|
||||
@ -130,7 +136,7 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
page_size=self.page_size,
|
||||
cu_seqlens_q=cu_query_lens,
|
||||
causal=causal,
|
||||
num_splits=self.max_num_splits,
|
||||
num_splits=max_num_splits,
|
||||
)
|
||||
return None
|
||||
|
||||
@ -148,6 +154,15 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
max_query_len = query_lens_cpu.max().item()
|
||||
max_seq_len = seq_lens_device.max().item()
|
||||
|
||||
# For Flash Attention MLA + full cudagraph
|
||||
max_num_splits = 0
|
||||
if self.use_full_cuda_graph and num_decode_tokens <= self.max_cudagraph_size:
|
||||
# NOTE(woosuk): Setting num_splits > 1 may increase the memory
|
||||
# usage, because the intermediate buffers of size [num_splits,
|
||||
# num_heads, num_tokens, head_size] are allocated. Therefore,
|
||||
# we only set num_splits when using cuda graphs.
|
||||
max_num_splits = self.max_num_splits
|
||||
|
||||
scheduler_metadata = self._schedule_decode(
|
||||
num_reqs=seq_lens_cpu.numel(),
|
||||
cu_query_lens=query_start_loc_device,
|
||||
@ -155,10 +170,9 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
seqlens=seq_lens_device,
|
||||
max_seq_len=max_seq_len,
|
||||
causal=True,
|
||||
max_num_splits=max_num_splits,
|
||||
)
|
||||
|
||||
# For FA3 + full cudagraph
|
||||
max_num_splits = 0
|
||||
if self.use_full_cuda_graph and scheduler_metadata is not None:
|
||||
n = scheduler_metadata.shape[0]
|
||||
# Ensure the persistent buffer is large enough
|
||||
@ -174,13 +188,6 @@ class FlashAttnMLAMetadataBuilder(MLACommonMetadataBuilder[FlashAttnMLAMetadata]
|
||||
self.scheduler_metadata[n:] = 0
|
||||
scheduler_metadata = self.scheduler_metadata[:n]
|
||||
|
||||
if num_decode_tokens <= self.max_cudagraph_size:
|
||||
# NOTE(woosuk): Setting num_splits > 1 may increase the memory
|
||||
# usage, because the intermediate buffers of size [num_splits,
|
||||
# num_heads, num_tokens, head_size] are allocated. Therefore,
|
||||
# we only set num_splits when using cuda graphs.
|
||||
max_num_splits = self.max_num_splits
|
||||
|
||||
if vllm_is_batch_invariant():
|
||||
max_num_splits = 1
|
||||
|
||||
|
||||
@ -373,7 +373,7 @@ def need_extra_keys(request: Request) -> bool:
|
||||
"""
|
||||
|
||||
# Multimodal requests need to include the MM hash.
|
||||
# LoRA requests need to include the LoRA ID.
|
||||
# LoRA requests need to include the LoRA name.
|
||||
# Request with provided cache salt need to include the salt.
|
||||
return (
|
||||
bool(request.mm_features)
|
||||
@ -446,26 +446,26 @@ def _gen_mm_extra_hash_keys(
|
||||
return extra_keys, curr_mm_idx
|
||||
|
||||
|
||||
def _gen_lora_extra_hash_keys(request: Request) -> list[int]:
|
||||
def _gen_lora_extra_hash_keys(request: Request) -> list[str]:
|
||||
"""Generate extra keys related to LoRA for block hash computation.
|
||||
|
||||
Args:
|
||||
request: The request object.
|
||||
|
||||
Returns:
|
||||
Return LoRA id of the request if it is a LoRA request. Return empty
|
||||
Return LoRA name of the request if it is a LoRA request. Return empty
|
||||
list otherwise.
|
||||
"""
|
||||
if not request.lora_request:
|
||||
return []
|
||||
return [request.lora_request.lora_int_id]
|
||||
return [request.lora_request.lora_name]
|
||||
|
||||
|
||||
def generate_block_hash_extra_keys(
|
||||
request: Request, start_token_idx: int, end_token_idx: int, start_mm_idx: int
|
||||
) -> tuple[tuple[Any, ...] | None, int]:
|
||||
"""Generate extra keys for the block hash. The extra keys can come from
|
||||
the multi-modal inputs and request specific metadata (e.g., LoRA ID).
|
||||
the multi-modal inputs and request specific metadata (e.g., LoRA name).
|
||||
|
||||
Args:
|
||||
request: The request object.
|
||||
@ -480,7 +480,7 @@ def generate_block_hash_extra_keys(
|
||||
mm_extra_keys, new_start_mm_idx = _gen_mm_extra_hash_keys(
|
||||
request, start_token_idx, end_token_idx, start_mm_idx
|
||||
)
|
||||
lora_extra_keys: list[int] = _gen_lora_extra_hash_keys(request)
|
||||
lora_extra_keys: list[str] = _gen_lora_extra_hash_keys(request)
|
||||
cache_salt_keys: list[str] = (
|
||||
[request.cache_salt] if (start_token_idx == 0 and request.cache_salt) else []
|
||||
)
|
||||
|
||||
@ -10,8 +10,9 @@ import zmq
|
||||
|
||||
from vllm.config import ParallelConfig
|
||||
from vllm.logger import init_logger
|
||||
from vllm.utils import get_mp_context, set_process_title
|
||||
from vllm.utils import get_mp_context
|
||||
from vllm.utils.network_utils import make_zmq_socket
|
||||
from vllm.utils.system_utils import set_process_title
|
||||
from vllm.v1.engine import EngineCoreOutputs, EngineCoreRequestType
|
||||
from vllm.v1.serial_utils import MsgpackDecoder
|
||||
from vllm.v1.utils import get_engine_client_zmq_addr, shutdown
|
||||
|
||||
@ -28,14 +28,11 @@ from vllm.multimodal import MULTIMODAL_REGISTRY
|
||||
from vllm.multimodal.cache import engine_receiver_cache_from_config
|
||||
from vllm.tasks import POOLING_TASKS, SupportedTask
|
||||
from vllm.transformers_utils.config import maybe_register_config_serialize_by_value
|
||||
from vllm.utils import (
|
||||
decorate_logs,
|
||||
set_process_title,
|
||||
)
|
||||
from vllm.utils.gc_utils import maybe_attach_gc_debug_callback
|
||||
from vllm.utils.hashing import get_hash_fn_by_name
|
||||
from vllm.utils.import_utils import resolve_obj_by_qualname
|
||||
from vllm.utils.network_utils import make_zmq_socket
|
||||
from vllm.utils.system_utils import decorate_logs, set_process_title
|
||||
from vllm.v1.core.kv_cache_utils import (
|
||||
BlockHash,
|
||||
generate_scheduler_kv_cache_config,
|
||||
|
||||
@ -35,17 +35,13 @@ from vllm.distributed.parallel_state import (
|
||||
)
|
||||
from vllm.envs import enable_envs_cache
|
||||
from vllm.logger import init_logger
|
||||
from vllm.utils import (
|
||||
_maybe_force_spawn,
|
||||
decorate_logs,
|
||||
get_mp_context,
|
||||
set_process_title,
|
||||
)
|
||||
from vllm.utils import _maybe_force_spawn, get_mp_context
|
||||
from vllm.utils.network_utils import (
|
||||
get_distributed_init_method,
|
||||
get_loopback_ip,
|
||||
get_open_port,
|
||||
)
|
||||
from vllm.utils.system_utils import decorate_logs, set_process_title
|
||||
from vllm.v1.core.sched.output import SchedulerOutput
|
||||
from vllm.v1.executor.abstract import Executor, FailureCallback
|
||||
from vllm.v1.outputs import AsyncModelRunnerOutput, DraftTokenIds, ModelRunnerOutput
|
||||
|
||||
@ -16,10 +16,10 @@ from vllm.multimodal.cache import worker_receiver_cache_from_config
|
||||
from vllm.utils import (
|
||||
enable_trace_function_call_for_thread,
|
||||
run_method,
|
||||
update_environment_variables,
|
||||
warn_for_unimplemented_methods,
|
||||
)
|
||||
from vllm.utils.import_utils import resolve_obj_by_qualname
|
||||
from vllm.utils.system_utils import update_environment_variables
|
||||
from vllm.v1.kv_cache_interface import KVCacheSpec
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
||||
Reference in New Issue
Block a user