[Misc] Split up pooling tasks (#10820)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2024-12-11 17:28:00 +08:00
committed by GitHub
parent 40766ca1b8
commit 8f10d5e393
27 changed files with 527 additions and 168 deletions

View File

@ -94,6 +94,8 @@ Documentation
:caption: Models
models/supported_models
models/generative_models
models/pooling_models
models/adding_model
models/enabling_multimodal_inputs

View File

@ -0,0 +1,146 @@
.. _generative_models:
Generative Models
=================
vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the :class:`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through :class:`~vllm.model_executor.layers.Sampler` to obtain the final text.
Offline Inference
-----------------
The :class:`~vllm.LLM` class provides various methods for offline inference.
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
For generative models, the only supported :code:`task` option is :code:`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
``LLM.generate``
^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM.generate` method is available to all generative models in vLLM.
It is similar to `its counterpart in HF Transformers <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__,
except that tokenization and detokenization are also performed automatically.
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate("Hello, my name is")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
You can optionally control the language generation by passing :class:`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting :code:`temperature=0`:
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
params = SamplingParams(temperature=0)
outputs = llm.generate("Hello, my name is", params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
A code example can be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
``LLM.beam_search``
^^^^^^^^^^^^^^^^^^^
The :class:`~vllm.LLM.beam_search` method implements `beam search <https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding>`__ on top of :class:`~vllm.LLM.generate`.
For example, to search using 5 beams and output at most 50 tokens:
.. code-block:: python
llm = LLM(model="facebook/opt-125m")
params = BeamSearchParams(beam_width=5, max_tokens=50)
outputs = llm.generate("Hello, my name is", params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``LLM.chat``
^^^^^^^^^^^^
The :class:`~vllm.LLM.chat` method implements chat functionality on top of :class:`~vllm.LLM.generate`.
In particular, it accepts input similar to `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
and automatically applies the model's `chat template <https://huggingface.co/docs/transformers/en/chat_templating>`__ to format the prompt.
.. important::
In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation.
.. code-block:: python
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
outputs = llm.chat(conversation)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
A code example can be found in `examples/offline_inference_chat.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py>`_.
If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:
.. code-block:: python
from vllm.entrypoints.chat_utils import load_chat_template
# You can find a list of existing chat templates under `examples/`
custom_template = load_chat_template(chat_template="<path_to_template>")
print("Loaded chat template:", custom_template)
outputs = llm.chat(conversation, chat_template=custom_template)
Online Inference
----------------
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
Please click on the above link for more details on how to launch the server.
Completions API
^^^^^^^^^^^^^^^
Our Completions API is similar to ``LLM.generate`` but only accepts text.
It is compatible with `OpenAI Completions API <https://platform.openai.com/docs/api-reference/completions>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
Chat API
^^^^^^^^
Our Chat API is similar to ``LLM.chat``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
It is compatible with `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_chat_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py>`_.

View File

@ -0,0 +1,99 @@
.. _pooling_models:
Pooling Models
==============
vLLM also supports pooling models, including embedding, reranking and reward models.
In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface.
These models use a :class:`~vllm.model_executor.layers.Pooler` to aggregate the final hidden states of the input
before returning them.
.. note::
We currently support pooling models primarily as a matter of convenience.
As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
Offline Inference
-----------------
The :class:`~vllm.LLM` class provides various methods for offline inference.
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
For pooling models, we support the following :code:`task` options:
- Embedding (:code:`"embed"` / :code:`"embedding"`)
- Classification (:code:`"classify"`)
- Sentence Pair Scoring (:code:`"score"`)
- Reward Modeling (:code:`"reward"`)
The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used:
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`).
You can customize the model's pooling method via the :code:`override_pooler_config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
``LLM.encode``
^^^^^^^^^^^^^^
The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM.
It returns the aggregated hidden states directly.
.. code-block:: python
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
outputs = llm.encode("Hello, my name is")
outputs = model.encode(prompts)
for output in outputs:
embeddings = output.outputs.embedding
print(f"Prompt: {prompt!r}, Embeddings (size={len(embeddings)}: {embeddings!r}")
A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_.
``LLM.score``
^^^^^^^^^^^^^
The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__.
These types of models serve as rerankers between candidate query-document pairs in RAG systems.
.. note::
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_.
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/models/embedding/language/test_scoring.py>`_ as reference.
Online Inference
----------------
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
Please click on the above link for more details on how to launch the server.
Embeddings API
^^^^^^^^^^^^^^
Our Embeddings API is similar to ``LLM.encode``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
so that you can use OpenAI client to interact with it.
A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_.
The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__,
so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it.
Score API
^^^^^^^^^
Our Score API is similar to ``LLM.score``.
Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it.

View File

@ -3,11 +3,21 @@
Supported Models
================
vLLM supports a variety of generative and embedding models from `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
This page lists the model architectures that are currently supported by vLLM.
vLLM supports generative and pooling models across various tasks.
If a model supports more than one task, you can set the task via the :code:`--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
For other models, you can check the :code:`config.json` file inside the model repository.
Loading a Model
^^^^^^^^^^^^^^^
HuggingFace Hub
+++++++++++++++
By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
.. tip::
@ -17,38 +27,57 @@ If the :code:`"architectures"` field contains a model architecture listed below,
from vllm import LLM
llm = LLM(model=...) # Name or path of your model
# For generative models (task=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
If vLLM successfully generates text, it indicates that your model is supported.
# For pooling models (task={embed,classify,reward}) only
llm = LLM(model=..., task="embed") # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
for instructions on how to implement your model in vLLM.
Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.
.. note::
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
ModelScope
++++++++++
.. code-block:: shell
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
$ export VLLM_USE_MODELSCOPE=True
.. code-block:: shell
And use with :code:`trust_remote_code=True`.
$ export VLLM_USE_MODELSCOPE=True
.. code-block:: python
And use with :code:`trust_remote_code=True`.
from vllm import LLM
.. code-block:: python
llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
from vllm import LLM
Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
Text Generation
---------------
# For generative models (task=generate) only
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward}) only
output = llm.encode("Hello, my name is")
print(output)
List of Text-only Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@ -328,8 +357,24 @@ Text Generation
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
Text Embedding
--------------
Pooling Models
++++++++++++++
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
Text Embedding (``--task embed``)
---------------------------------
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 50 5 5
@ -371,13 +416,6 @@ Text Embedding
-
-
.. important::
Some model architectures support both generation and embedding tasks.
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
.. tip::
You can override the model's pooling method by passing :code:`--override-pooler-config`.
.. note::
:code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
@ -389,8 +427,8 @@ Text Embedding
On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
despite being described otherwise on its model card.
Reward Modeling
---------------
Reward Modeling (``--task reward``)
-----------------------------------
.. list-table::
:widths: 25 25 50 5 5
@ -416,11 +454,8 @@ Reward Modeling
For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
.. note::
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
Classification
---------------
Classification (``--task classify``)
------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@ -437,11 +472,8 @@ Classification
- ✅︎
- ✅︎
.. note::
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
Sentence Pair Scoring
---------------------
Sentence Pair Scoring (``--task score``)
----------------------------------------
.. list-table::
:widths: 25 25 50 5 5
@ -468,13 +500,10 @@ Sentence Pair Scoring
-
-
.. note::
These models are supported in both offline and online inference via Score API.
.. _supported_mm_models:
Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^
List of Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following modalities are supported depending on the model:
@ -491,8 +520,15 @@ On the other hand, modalities separated by :code:`/` are mutually exclusive.
- e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
Text Generation
---------------
See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
Generative Models
+++++++++++++++++
See :ref:`this page <generative_models>` for more information on how to use generative models.
Text Generation (``--task generate``)
-------------------------------------
.. list-table::
:widths: 25 25 15 20 5 5 5
@ -696,8 +732,24 @@ Text Generation
The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
Multimodal Embedding
--------------------
Pooling Models
++++++++++++++
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
.. important::
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
Text Embedding (``--task embed``)
---------------------------------
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
.. note::
To get the best results, you should use pooling models that are specifically trained as such.
The following table lists those that are tested in vLLM.
.. list-table::
:widths: 25 25 15 25 5 5
@ -728,12 +780,7 @@ Multimodal Embedding
-
- ✅︎
.. important::
Some model architectures support both generation and embedding tasks.
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
.. tip::
You can override the model's pooling method by passing :code:`--override-pooler-config`.
----
Model Support Policy
=====================

View File

@ -39,13 +39,13 @@ Feature x Feature
- :abbr:`prmpt adptr (Prompt Adapter)`
- :ref:`SD <spec_decode>`
- CUDA graph
- :abbr:`emd (Embedding Models)`
- :abbr:`pooling (Pooling Models)`
- :abbr:`enc-dec (Encoder-Decoder Models)`
- :abbr:`logP (Logprobs)`
- :abbr:`prmpt logP (Prompt Logprobs)`
- :abbr:`async output (Async Output Processing)`
- multi-step
- :abbr:`mm (Multimodal)`
- :abbr:`mm (Multimodal Inputs)`
- best-of
- beam-search
- :abbr:`guided dec (Guided Decoding)`
@ -151,7 +151,7 @@ Feature x Feature
-
-
-
* - :abbr:`emd (Embedding Models)`
* - :abbr:`pooling (Pooling Models)`
- ✗
- ✗
- ✗
@ -253,7 +253,7 @@ Feature x Feature
-
-
-
* - :abbr:`mm (Multimodal)`
* - :abbr:`mm (Multimodal Inputs)`
- ✅
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
@ -386,7 +386,7 @@ Feature x Hardware
- ✅
- ✗
- ✅
* - :abbr:`emd (Embedding Models)`
* - :abbr:`pooling (Pooling Models)`
- ✅
- ✅
- ✅
@ -402,7 +402,7 @@ Feature x Hardware
- ✅
- ✅
- ✗
* - :abbr:`mm (Multimodal)`
* - :abbr:`mm (Multimodal Inputs)`
- ✅
- ✅
- ✅