[Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@ -94,6 +94,8 @@ Documentation
|
||||
:caption: Models
|
||||
|
||||
models/supported_models
|
||||
models/generative_models
|
||||
models/pooling_models
|
||||
models/adding_model
|
||||
models/enabling_multimodal_inputs
|
||||
|
||||
|
||||
146
docs/source/models/generative_models.rst
Normal file
146
docs/source/models/generative_models.rst
Normal file
@ -0,0 +1,146 @@
|
||||
.. _generative_models:
|
||||
|
||||
Generative Models
|
||||
=================
|
||||
|
||||
vLLM provides first-class support for generative models, which covers most of LLMs.
|
||||
|
||||
In vLLM, generative models implement the :class:`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
|
||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||
which are then passed through :class:`~vllm.model_executor.layers.Sampler` to obtain the final text.
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
The :class:`~vllm.LLM` class provides various methods for offline inference.
|
||||
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
|
||||
|
||||
For generative models, the only supported :code:`task` option is :code:`"generate"`.
|
||||
Usually, this is automatically inferred so you don't have to specify it.
|
||||
|
||||
``LLM.generate``
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
The :class:`~vllm.LLM.generate` method is available to all generative models in vLLM.
|
||||
It is similar to `its counterpart in HF Transformers <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__,
|
||||
except that tokenization and detokenization are also performed automatically.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
outputs = llm.generate("Hello, my name is")
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
You can optionally control the language generation by passing :class:`~vllm.SamplingParams`.
|
||||
For example, you can use greedy sampling by setting :code:`temperature=0`:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
params = SamplingParams(temperature=0)
|
||||
outputs = llm.generate("Hello, my name is", params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
A code example can be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
|
||||
|
||||
``LLM.beam_search``
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The :class:`~vllm.LLM.beam_search` method implements `beam search <https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding>`__ on top of :class:`~vllm.LLM.generate`.
|
||||
For example, to search using 5 beams and output at most 50 tokens:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
params = BeamSearchParams(beam_width=5, max_tokens=50)
|
||||
outputs = llm.generate("Hello, my name is", params)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
``LLM.chat``
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The :class:`~vllm.LLM.chat` method implements chat functionality on top of :class:`~vllm.LLM.generate`.
|
||||
In particular, it accepts input similar to `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
|
||||
and automatically applies the model's `chat template <https://huggingface.co/docs/transformers/en/chat_templating>`__ to format the prompt.
|
||||
|
||||
.. important::
|
||||
|
||||
In general, only instruction-tuned models have a chat template.
|
||||
Base models may perform poorly as they are not trained to respond to the chat conversation.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
|
||||
conversation = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Hello! How can I assist you today?"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write an essay about the importance of higher education.",
|
||||
},
|
||||
]
|
||||
outputs = llm.chat(conversation)
|
||||
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
A code example can be found in `examples/offline_inference_chat.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py>`_.
|
||||
|
||||
If the model doesn't have a chat template or you want to specify another one,
|
||||
you can explicitly pass a chat template:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm.entrypoints.chat_utils import load_chat_template
|
||||
|
||||
# You can find a list of existing chat templates under `examples/`
|
||||
custom_template = load_chat_template(chat_template="<path_to_template>")
|
||||
print("Loaded chat template:", custom_template)
|
||||
|
||||
outputs = llm.chat(conversation, chat_template=custom_template)
|
||||
|
||||
Online Inference
|
||||
----------------
|
||||
|
||||
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
|
||||
Please click on the above link for more details on how to launch the server.
|
||||
|
||||
Completions API
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Our Completions API is similar to ``LLM.generate`` but only accepts text.
|
||||
It is compatible with `OpenAI Completions API <https://platform.openai.com/docs/api-reference/completions>`__
|
||||
so that you can use OpenAI client to interact with it.
|
||||
A code example can be found in `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
|
||||
|
||||
Chat API
|
||||
^^^^^^^^
|
||||
|
||||
Our Chat API is similar to ``LLM.chat``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
|
||||
It is compatible with `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__
|
||||
so that you can use OpenAI client to interact with it.
|
||||
A code example can be found in `examples/openai_chat_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py>`_.
|
||||
99
docs/source/models/pooling_models.rst
Normal file
99
docs/source/models/pooling_models.rst
Normal file
@ -0,0 +1,99 @@
|
||||
.. _pooling_models:
|
||||
|
||||
Pooling Models
|
||||
==============
|
||||
|
||||
vLLM also supports pooling models, including embedding, reranking and reward models.
|
||||
|
||||
In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface.
|
||||
These models use a :class:`~vllm.model_executor.layers.Pooler` to aggregate the final hidden states of the input
|
||||
before returning them.
|
||||
|
||||
.. note::
|
||||
|
||||
We currently support pooling models primarily as a matter of convenience.
|
||||
As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to
|
||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||
|
||||
Offline Inference
|
||||
-----------------
|
||||
|
||||
The :class:`~vllm.LLM` class provides various methods for offline inference.
|
||||
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model.
|
||||
|
||||
For pooling models, we support the following :code:`task` options:
|
||||
|
||||
- Embedding (:code:`"embed"` / :code:`"embedding"`)
|
||||
- Classification (:code:`"classify"`)
|
||||
- Sentence Pair Scoring (:code:`"score"`)
|
||||
- Reward Modeling (:code:`"reward"`)
|
||||
|
||||
The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used:
|
||||
|
||||
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
|
||||
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
|
||||
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
|
||||
- Reward Modeling: Extract all of the hidden states and return them directly.
|
||||
|
||||
When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models,
|
||||
we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`).
|
||||
|
||||
You can customize the model's pooling method via the :code:`override_pooler_config` option,
|
||||
which takes priority over both the model's and Sentence Transformers's defaults.
|
||||
|
||||
``LLM.encode``
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM.
|
||||
It returns the aggregated hidden states directly.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
|
||||
outputs = llm.encode("Hello, my name is")
|
||||
|
||||
outputs = model.encode(prompts)
|
||||
for output in outputs:
|
||||
embeddings = output.outputs.embedding
|
||||
print(f"Prompt: {prompt!r}, Embeddings (size={len(embeddings)}: {embeddings!r}")
|
||||
|
||||
A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_.
|
||||
|
||||
``LLM.score``
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs.
|
||||
It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__.
|
||||
These types of models serve as rerankers between candidate query-document pairs in RAG systems.
|
||||
|
||||
.. note::
|
||||
|
||||
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
|
||||
To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_.
|
||||
|
||||
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/models/embedding/language/test_scoring.py>`_ as reference.
|
||||
|
||||
Online Inference
|
||||
----------------
|
||||
|
||||
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference.
|
||||
Please click on the above link for more details on how to launch the server.
|
||||
|
||||
Embeddings API
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
Our Embeddings API is similar to ``LLM.encode``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`.
|
||||
|
||||
The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
|
||||
so that you can use OpenAI client to interact with it.
|
||||
A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_.
|
||||
|
||||
The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__
|
||||
that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__,
|
||||
so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it.
|
||||
|
||||
Score API
|
||||
^^^^^^^^^
|
||||
|
||||
Our Score API is similar to ``LLM.score``.
|
||||
Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it.
|
||||
@ -3,11 +3,21 @@
|
||||
Supported Models
|
||||
================
|
||||
|
||||
vLLM supports a variety of generative and embedding models from `HuggingFace (HF) Transformers <https://huggingface.co/models>`_.
|
||||
This page lists the model architectures that are currently supported by vLLM.
|
||||
vLLM supports generative and pooling models across various tasks.
|
||||
If a model supports more than one task, you can set the task via the :code:`--task` argument.
|
||||
|
||||
For each task, we list the model architectures that have been implemented in vLLM.
|
||||
Alongside each architecture, we include some popular models that use it.
|
||||
|
||||
For other models, you can check the :code:`config.json` file inside the model repository.
|
||||
Loading a Model
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
HuggingFace Hub
|
||||
+++++++++++++++
|
||||
|
||||
By default, vLLM loads models from `HuggingFace (HF) Hub <https://huggingface.co/models>`_.
|
||||
|
||||
To determine whether a given model is supported, you can check the :code:`config.json` file inside the HF repository.
|
||||
If the :code:`"architectures"` field contains a model architecture listed below, then it should be supported in theory.
|
||||
|
||||
.. tip::
|
||||
@ -17,38 +27,57 @@ If the :code:`"architectures"` field contains a model architecture listed below,
|
||||
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model=...) # Name or path of your model
|
||||
# For generative models (task=generate) only
|
||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
If vLLM successfully generates text, it indicates that your model is supported.
|
||||
# For pooling models (task={embed,classify,reward}) only
|
||||
llm = LLM(model=..., task="embed") # Name or path of your model
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
|
||||
|
||||
Otherwise, please refer to :ref:`Adding a New Model <adding_a_new_model>` and :ref:`Enabling Multimodal Inputs <enabling_multimodal_inputs>`
|
||||
for instructions on how to implement your model in vLLM.
|
||||
Alternatively, you can `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ to request vLLM support.
|
||||
|
||||
.. note::
|
||||
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
|
||||
ModelScope
|
||||
++++++++++
|
||||
|
||||
.. code-block:: shell
|
||||
To use models from `ModelScope <https://www.modelscope.cn>`_ instead of HuggingFace Hub, set an environment variable:
|
||||
|
||||
$ export VLLM_USE_MODELSCOPE=True
|
||||
.. code-block:: shell
|
||||
|
||||
And use with :code:`trust_remote_code=True`.
|
||||
$ export VLLM_USE_MODELSCOPE=True
|
||||
|
||||
.. code-block:: python
|
||||
And use with :code:`trust_remote_code=True`.
|
||||
|
||||
from vllm import LLM
|
||||
.. code-block:: python
|
||||
|
||||
llm = LLM(model=..., revision=..., trust_remote_code=True) # Name or path of your model
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
from vllm import LLM
|
||||
|
||||
Text-only Language Models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
|
||||
|
||||
Text Generation
|
||||
---------------
|
||||
# For generative models (task=generate) only
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
# For pooling models (task={embed,classify,reward}) only
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
List of Text-only Language Models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Generative Models
|
||||
+++++++++++++++++
|
||||
|
||||
See :ref:`this page <generative_models>` for more information on how to use generative models.
|
||||
|
||||
Text Generation (``--task generate``)
|
||||
-------------------------------------
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 50 5 5
|
||||
@ -328,8 +357,24 @@ Text Generation
|
||||
.. note::
|
||||
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
|
||||
|
||||
Text Embedding
|
||||
--------------
|
||||
Pooling Models
|
||||
++++++++++++++
|
||||
|
||||
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
|
||||
|
||||
.. important::
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
Text Embedding (``--task embed``)
|
||||
---------------------------------
|
||||
|
||||
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
|
||||
|
||||
.. note::
|
||||
To get the best results, you should use pooling models that are specifically trained as such.
|
||||
|
||||
The following table lists those that are tested in vLLM.
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 50 5 5
|
||||
@ -371,13 +416,6 @@ Text Embedding
|
||||
-
|
||||
-
|
||||
|
||||
.. important::
|
||||
Some model architectures support both generation and embedding tasks.
|
||||
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
|
||||
|
||||
.. tip::
|
||||
You can override the model's pooling method by passing :code:`--override-pooler-config`.
|
||||
|
||||
.. note::
|
||||
:code:`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
|
||||
You should manually set mean pooling by passing :code:`--override-pooler-config '{"pooling_type": "MEAN"}'`.
|
||||
@ -389,8 +427,8 @@ Text Embedding
|
||||
On the other hand, its 1.5B variant (:code:`Alibaba-NLP/gte-Qwen2-1.5B-instruct`) uses causal attention
|
||||
despite being described otherwise on its model card.
|
||||
|
||||
Reward Modeling
|
||||
---------------
|
||||
Reward Modeling (``--task reward``)
|
||||
-----------------------------------
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 50 5 5
|
||||
@ -416,11 +454,8 @@ Reward Modeling
|
||||
For process-supervised reward models such as :code:`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
|
||||
e.g.: :code:`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
|
||||
|
||||
.. note::
|
||||
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
|
||||
|
||||
Classification
|
||||
---------------
|
||||
Classification (``--task classify``)
|
||||
------------------------------------
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 50 5 5
|
||||
@ -437,11 +472,8 @@ Classification
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
|
||||
.. note::
|
||||
As an interim measure, these models are supported in both offline and online inference via Embeddings API.
|
||||
|
||||
Sentence Pair Scoring
|
||||
---------------------
|
||||
Sentence Pair Scoring (``--task score``)
|
||||
----------------------------------------
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 50 5 5
|
||||
@ -468,13 +500,10 @@ Sentence Pair Scoring
|
||||
-
|
||||
-
|
||||
|
||||
.. note::
|
||||
These models are supported in both offline and online inference via Score API.
|
||||
|
||||
.. _supported_mm_models:
|
||||
|
||||
Multimodal Language Models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
List of Multimodal Language Models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The following modalities are supported depending on the model:
|
||||
|
||||
@ -491,8 +520,15 @@ On the other hand, modalities separated by :code:`/` are mutually exclusive.
|
||||
|
||||
- e.g.: :code:`T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
|
||||
|
||||
Text Generation
|
||||
---------------
|
||||
See :ref:`this page <multimodal_inputs>` on how to pass multi-modal inputs to the model.
|
||||
|
||||
Generative Models
|
||||
+++++++++++++++++
|
||||
|
||||
See :ref:`this page <generative_models>` for more information on how to use generative models.
|
||||
|
||||
Text Generation (``--task generate``)
|
||||
-------------------------------------
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 15 20 5 5 5
|
||||
@ -696,8 +732,24 @@ Text Generation
|
||||
The official :code:`openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
|
||||
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
|
||||
|
||||
Multimodal Embedding
|
||||
--------------------
|
||||
Pooling Models
|
||||
++++++++++++++
|
||||
|
||||
See :ref:`this page <pooling_models>` for more information on how to use pooling models.
|
||||
|
||||
.. important::
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
Text Embedding (``--task embed``)
|
||||
---------------------------------
|
||||
|
||||
Any text generation model can be converted into an embedding model by passing :code:`--task embed`.
|
||||
|
||||
.. note::
|
||||
To get the best results, you should use pooling models that are specifically trained as such.
|
||||
|
||||
The following table lists those that are tested in vLLM.
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 15 25 5 5
|
||||
@ -728,12 +780,7 @@ Multimodal Embedding
|
||||
-
|
||||
- ✅︎
|
||||
|
||||
.. important::
|
||||
Some model architectures support both generation and embedding tasks.
|
||||
In this case, you have to pass :code:`--task embedding` to run the model in embedding mode.
|
||||
|
||||
.. tip::
|
||||
You can override the model's pooling method by passing :code:`--override-pooler-config`.
|
||||
----
|
||||
|
||||
Model Support Policy
|
||||
=====================
|
||||
|
||||
@ -39,13 +39,13 @@ Feature x Feature
|
||||
- :abbr:`prmpt adptr (Prompt Adapter)`
|
||||
- :ref:`SD <spec_decode>`
|
||||
- CUDA graph
|
||||
- :abbr:`emd (Embedding Models)`
|
||||
- :abbr:`pooling (Pooling Models)`
|
||||
- :abbr:`enc-dec (Encoder-Decoder Models)`
|
||||
- :abbr:`logP (Logprobs)`
|
||||
- :abbr:`prmpt logP (Prompt Logprobs)`
|
||||
- :abbr:`async output (Async Output Processing)`
|
||||
- multi-step
|
||||
- :abbr:`mm (Multimodal)`
|
||||
- :abbr:`mm (Multimodal Inputs)`
|
||||
- best-of
|
||||
- beam-search
|
||||
- :abbr:`guided dec (Guided Decoding)`
|
||||
@ -151,7 +151,7 @@ Feature x Feature
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`emd (Embedding Models)`
|
||||
* - :abbr:`pooling (Pooling Models)`
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
@ -253,7 +253,7 @@ Feature x Feature
|
||||
-
|
||||
-
|
||||
-
|
||||
* - :abbr:`mm (Multimodal)`
|
||||
* - :abbr:`mm (Multimodal Inputs)`
|
||||
- ✅
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/8348>`__
|
||||
- `✗ <https://github.com/vllm-project/vllm/pull/7199>`__
|
||||
@ -386,7 +386,7 @@ Feature x Hardware
|
||||
- ✅
|
||||
- ✗
|
||||
- ✅
|
||||
* - :abbr:`emd (Embedding Models)`
|
||||
* - :abbr:`pooling (Pooling Models)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
@ -402,7 +402,7 @@ Feature x Hardware
|
||||
- ✅
|
||||
- ✅
|
||||
- ✗
|
||||
* - :abbr:`mm (Multimodal)`
|
||||
* - :abbr:`mm (Multimodal Inputs)`
|
||||
- ✅
|
||||
- ✅
|
||||
- ✅
|
||||
|
||||
Reference in New Issue
Block a user