[Deprecation][2/N] Replace --task with --runner and --convert (#21470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||
```
|
||||
|
||||
@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
|
||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
|
||||
--max-model-len 4096 --enable-prompt-embeds
|
||||
```
|
||||
|
||||
|
||||
@ -2,12 +2,19 @@
|
||||
|
||||
vLLM provides first-class support for generative models, which covers most of LLMs.
|
||||
|
||||
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
||||
|
||||
For generative models, the only supported `--task` option is `"generate"`.
|
||||
Usually, this is automatically inferred so you don't have to specify it.
|
||||
## Configuration
|
||||
|
||||
### Model Runner (`--runner`)
|
||||
|
||||
Run a model in generation mode via the option `--runner generate`.
|
||||
|
||||
!!! tip
|
||||
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||
detect the model runner to use via `--runner auto`.
|
||||
|
||||
## Offline Inference
|
||||
|
||||
|
||||
@ -1,9 +1,9 @@
|
||||
# Pooling Models
|
||||
|
||||
vLLM also supports pooling models, including embedding, reranking and reward models.
|
||||
vLLM also supports pooling models, such as embedding, classification and reward models.
|
||||
|
||||
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
|
||||
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
|
||||
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
|
||||
before returning them.
|
||||
|
||||
!!! note
|
||||
@ -11,18 +11,39 @@ before returning them.
|
||||
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
|
||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||
|
||||
If the model doesn't implement this interface, you can set `--task` which tells vLLM
|
||||
to convert the model into a pooling model.
|
||||
## Configuration
|
||||
|
||||
| `--task` | Model type | Supported pooling tasks |
|
||||
|------------|----------------------|-------------------------------|
|
||||
| `embed` | Embedding model | `encode`, `embed` |
|
||||
| `classify` | Classification model | `encode`, `classify`, `score` |
|
||||
| `reward` | Reward model | `encode` |
|
||||
### Model Runner
|
||||
|
||||
## Pooling Tasks
|
||||
Run a model in pooling mode via the option `--runner pooling`.
|
||||
|
||||
In vLLM, we define the following pooling tasks and corresponding APIs:
|
||||
!!! tip
|
||||
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||
detect the model runner to use via `--runner auto`.
|
||||
|
||||
### Model Conversion
|
||||
|
||||
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
|
||||
|
||||
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
|
||||
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
|
||||
vLLM will attempt to automatically convert the model according to the architecture names
|
||||
shown in the table below.
|
||||
|
||||
| Architecture | `--convert` | Supported pooling tasks |
|
||||
|-------------------------------------------------|-------------|-------------------------------|
|
||||
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
|
||||
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
|
||||
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
|
||||
|
||||
!!! tip
|
||||
You can explicitly set `--convert <type>` to specify how to convert the model.
|
||||
|
||||
### Pooling Tasks
|
||||
|
||||
Each pooling model in vLLM supports one or more of these tasks according to
|
||||
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
|
||||
enabling the corresponding APIs:
|
||||
|
||||
| Task | APIs |
|
||||
|------------|--------------------|
|
||||
@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
|
||||
| `classify` | `classify` |
|
||||
| `score` | `score` |
|
||||
|
||||
\*The `score` API falls back to `embed` task if the model does not support `score` task.
|
||||
\* The `score` API falls back to `embed` task if the model does not support `score` task.
|
||||
|
||||
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
|
||||
### Pooler Configuration
|
||||
|
||||
By default, the pooler assigned to each task has the following attributes:
|
||||
#### Predefined models
|
||||
|
||||
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
|
||||
you can override some of its attributes via the `--override-pooler-config` option.
|
||||
|
||||
#### Converted models
|
||||
|
||||
If the model has been converted via `--convert` (see above),
|
||||
the pooler assigned to each task has the following attributes by default:
|
||||
|
||||
| Task | Pooling Type | Normalization | Softmax |
|
||||
|------------|----------------|---------------|---------|
|
||||
@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
|
||||
| `embed` | `LAST` | ✅︎ | ❌ |
|
||||
| `classify` | `LAST` | ❌ | ✅︎ |
|
||||
|
||||
These defaults may be overridden by the model's implementation in vLLM.
|
||||
|
||||
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
||||
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
|
||||
which takes priority over the model's defaults.
|
||||
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
|
||||
|
||||
You can further customize this via the `--override-pooler-config` option,
|
||||
which takes priority over both the model's and Sentence Transformers's defaults.
|
||||
|
||||
!!! note
|
||||
|
||||
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
|
||||
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
|
||||
|
||||
## Offline Inference
|
||||
|
||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||
@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
|
||||
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
|
||||
(output,) = llm.encode("Hello, my name is")
|
||||
|
||||
data = output.outputs.data
|
||||
@ -85,7 +106,7 @@ It is primarily designed for embedding models.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
|
||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
|
||||
(output,) = llm.embed("Hello, my name is")
|
||||
|
||||
embeds = output.outputs.embedding
|
||||
@ -102,7 +123,7 @@ It is primarily designed for classification models.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
|
||||
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
|
||||
(output,) = llm.classify("Hello, my name is")
|
||||
|
||||
probs = output.outputs.probs
|
||||
@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
|
||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
|
||||
(output,) = llm.score("What is the capital of France?",
|
||||
"The capital of Brazil is Brasilia.")
|
||||
|
||||
@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
|
||||
from vllm import LLM, PoolingParams
|
||||
|
||||
llm = LLM(model="jinaai/jina-embeddings-v3",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
trust_remote_code=True)
|
||||
outputs = llm.embed(["Follow the white rabbit."],
|
||||
pooling_params=PoolingParams(dimensions=32))
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
# Supported Models
|
||||
|
||||
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
|
||||
If a model supports more than one task, you can set the task via the `--task` argument.
|
||||
|
||||
For each task, we list the model architectures that have been implemented in vLLM.
|
||||
Alongside each architecture, we include some popular models that use it.
|
||||
@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
||||
llm = LLM(model=...) # Name or path of your model
|
||||
llm.apply_model(lambda model: print(type(model)))
|
||||
```
|
||||
|
||||
@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# For generative models (task=generate) only
|
||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
||||
# For generative models (runner=generate) only
|
||||
llm = LLM(model=..., runner="generate") # Name or path of your model
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
# For pooling models (task={embed,classify,reward,score}) only
|
||||
llm = LLM(model=..., task="embed") # Name or path of your model
|
||||
# For pooling models (runner=pooling) only
|
||||
llm = LLM(model=..., runner="pooling") # Name or path of your model
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
```
|
||||
@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
|
||||
llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
|
||||
|
||||
# For generative models (task=generate) only
|
||||
# For generative models (runner=generate) only
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
# For pooling models (task={embed,classify,reward,score}) only
|
||||
# For pooling models (runner=pooling) only
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
```
|
||||
@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
<style>
|
||||
th {
|
||||
white-space: nowrap;
|
||||
@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
||||
|
||||
!!! important
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
#### Text Embedding
|
||||
|
||||
Specified using `--task embed`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
|
||||
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
||||
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
||||
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
||||
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
||||
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
||||
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
||||
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
||||
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
||||
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
!!! note
|
||||
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
|
||||
@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
|
||||
|
||||
#### Reward Modeling
|
||||
|
||||
Specified using `--task reward`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
If your model is not in the above list, we will try to automatically convert the model using
|
||||
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
||||
@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
|
||||
|
||||
#### Classification
|
||||
|
||||
Specified using `--task classify`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
||||
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
If your model is not in the above list, we will try to automatically convert the model using
|
||||
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
||||
|
||||
#### Sentence Pair Scoring
|
||||
|
||||
Specified using `--task score`.
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
|
||||
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
|
||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
|
||||
|
||||
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|---------------------|
|
||||
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
|
||||
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
|
||||
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
|
||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
|
||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
|
||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
!!! note
|
||||
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
|
||||
@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
||||
@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
||||
|
||||
#### Transcription
|
||||
|
||||
Specified using `--task transcription`.
|
||||
|
||||
Speech2Text models trained specifically for Automatic Speech Recognition.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
||||
|
||||
!!! important
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
#### Text Embedding
|
||||
|
||||
Specified using `--task embed`.
|
||||
|
||||
Any text generation model can be converted into an embedding model by passing `--task embed`.
|
||||
|
||||
!!! note
|
||||
To get the best results, you should use pooling models that are specifically trained as such.
|
||||
|
||||
@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
---
|
||||
|
||||
#### Scoring
|
||||
|
||||
Specified using `--task score`.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
||||
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
## Model Support Policy
|
||||
|
||||
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
|
||||
|
||||
@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
|
||||
We currently support the following OpenAI APIs:
|
||||
|
||||
- [Completions API][completions-api] (`/v1/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
|
||||
- Only applicable to [text generation models](../models/generative_models.md).
|
||||
- *Note: `suffix` parameter is not supported.*
|
||||
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
|
||||
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
|
||||
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
||||
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
|
||||
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
|
||||
- Only applicable to [embedding models](../models/pooling_models.md).
|
||||
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
|
||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
- [Translation API][translations-api] (`/v1/audio/translations`)
|
||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
|
||||
In addition, we have the following custom APIs:
|
||||
|
||||
@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
|
||||
- [Pooling API][pooling-api] (`/pooling`)
|
||||
- Applicable to all [pooling models](../models/pooling_models.md).
|
||||
- [Classification API][classification-api] (`/classify`)
|
||||
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
|
||||
- Only applicable to [classification models](../models/pooling_models.md).
|
||||
- [Score API][score-api] (`/score`)
|
||||
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
|
||||
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
||||
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md).
|
||||
|
||||
[](){ #chat-template }
|
||||
|
||||
@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 4096 \
|
||||
--chat-template examples/template_vlm2vec.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 8192 \
|
||||
--chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
||||
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
||||
|
||||
Reference in New Issue
Block a user