Remove unnecessary explicit title anchors and use relative links instead (#20620)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@ -48,4 +48,4 @@ For more information, check out the following:
|
||||
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
|
||||
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
|
||||
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
|
||||
- [vLLM Meetups][meetups]
|
||||
- [vLLM Meetups](community/meetups.md)
|
||||
|
||||
@ -64,7 +64,7 @@ vLLM provides experimental support for multi-modal models through the [vllm.mult
|
||||
Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
|
||||
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
|
||||
|
||||
Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal].
|
||||
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
|
||||
|
||||
- [vllm.multimodal.MULTIMODAL_REGISTRY][]
|
||||
|
||||
|
||||
@ -1,6 +1,5 @@
|
||||
---
|
||||
title: Contact Us
|
||||
---
|
||||
[](){ #contactus }
|
||||
|
||||
--8<-- "README.md:contact-us"
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Meetups
|
||||
---
|
||||
[](){ #meetups }
|
||||
|
||||
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
|
||||
|
||||
|
||||
@ -33,7 +33,7 @@ Quantized models take less memory at the cost of lower precision.
|
||||
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
|
||||
and used directly without extra configuration.
|
||||
|
||||
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
|
||||
Dynamic quantization is also supported via the `quantization` option -- see [here](../features/quantization/README.md) for more details.
|
||||
|
||||
## Context length and batch size
|
||||
|
||||
|
||||
@ -1,12 +1,11 @@
|
||||
---
|
||||
title: Engine Arguments
|
||||
---
|
||||
[](){ #engine-args }
|
||||
|
||||
Engine arguments control the behavior of the vLLM engine.
|
||||
|
||||
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
|
||||
- For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`.
|
||||
- For [offline inference](../serving/offline_inference.md), they are part of the arguments to [LLM][vllm.LLM] class.
|
||||
- For [online serving](../serving/openai_compatible_server.md), they are part of the arguments to `vllm serve`.
|
||||
|
||||
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
|
||||
|
||||
|
||||
@ -20,4 +20,4 @@ model = LLM(
|
||||
)
|
||||
```
|
||||
|
||||
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
|
||||
Our [list of supported models](../models/supported_models.md) shows the model architectures that are recognized by vLLM.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Server Arguments
|
||||
---
|
||||
[](){ #serve-args }
|
||||
|
||||
The `vllm serve` command is used to launch the OpenAI-compatible server.
|
||||
|
||||
@ -13,7 +12,7 @@ To see the available CLI arguments, run `vllm serve --help`!
|
||||
## Configuration file
|
||||
|
||||
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
|
||||
The argument names must be the long form of those outlined [above][serve-args].
|
||||
The argument names must be the long form of those outlined [above](serve_args.md).
|
||||
|
||||
For example:
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Benchmark Suites
|
||||
---
|
||||
[](){ #benchmarks }
|
||||
|
||||
vLLM contains two sets of benchmarks:
|
||||
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# Dockerfile
|
||||
|
||||
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
|
||||
More information about deploying with Docker can be found [here][deployment-docker].
|
||||
More information about deploying with Docker can be found [here](../../deployment/docker.md).
|
||||
|
||||
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
|
||||
|
||||
|
||||
@ -1,12 +1,11 @@
|
||||
---
|
||||
title: Summary
|
||||
---
|
||||
[](){ #new-model }
|
||||
|
||||
!!! important
|
||||
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
|
||||
|
||||
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features][compatibility-matrix] to optimize their performance.
|
||||
vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/compatibility_matrix.md) to optimize their performance.
|
||||
|
||||
The complexity of integrating a model into vLLM depends heavily on the model's architecture.
|
||||
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Basic Model
|
||||
---
|
||||
[](){ #new-model-basic }
|
||||
|
||||
This guide walks you through the steps to implement a basic vLLM model.
|
||||
|
||||
@ -108,7 +107,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
|
||||
|
||||
## 5. Register your model
|
||||
|
||||
See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.
|
||||
See [this page](registration.md) for instructions on how to register your new model to be used by vLLM.
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
||||
|
||||
@ -1,13 +1,12 @@
|
||||
---
|
||||
title: Multi-Modal Support
|
||||
---
|
||||
[](){ #supports-multimodal }
|
||||
|
||||
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
|
||||
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).
|
||||
|
||||
## 1. Update the base vLLM model
|
||||
|
||||
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
|
||||
It is assumed that you have already implemented the model in vLLM according to [these steps](basic.md).
|
||||
Further update the model as follows:
|
||||
|
||||
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
|
||||
@ -483,7 +482,7 @@ Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.proce
|
||||
to fill in the missing details about HF processing.
|
||||
|
||||
!!! info
|
||||
[Multi-Modal Data Processing][mm-processing]
|
||||
[Multi-Modal Data Processing](../../design/mm_processing.md)
|
||||
|
||||
### Multi-modal fields
|
||||
|
||||
@ -846,7 +845,7 @@ Examples:
|
||||
|
||||
### Handling prompt updates unrelated to multi-modal data
|
||||
|
||||
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
|
||||
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](../../design/mm_processing.md).
|
||||
|
||||
Examples:
|
||||
|
||||
|
||||
@ -1,10 +1,9 @@
|
||||
---
|
||||
title: Registering a Model
|
||||
---
|
||||
[](){ #new-model-registration }
|
||||
|
||||
vLLM relies on a model registry to determine how to run each model.
|
||||
A list of pre-registered architectures can be found [here][supported-models].
|
||||
A list of pre-registered architectures can be found [here](../../models/supported_models.md).
|
||||
|
||||
If your model is not on this list, you must register it to vLLM.
|
||||
This page provides detailed instructions on how to do so.
|
||||
@ -14,16 +13,16 @@ This page provides detailed instructions on how to do so.
|
||||
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
|
||||
This gives you the ability to modify the codebase and test your model.
|
||||
|
||||
After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory.
|
||||
After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory.
|
||||
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
|
||||
Finally, update our [list of supported models][supported-models] to promote your model!
|
||||
Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
|
||||
|
||||
!!! important
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
|
||||
## Out-of-tree models
|
||||
|
||||
You can load an external model [using a plugin][plugin-system] without modifying the vLLM codebase.
|
||||
You can load an external model [using a plugin](../../design/plugin_system.md) without modifying the vLLM codebase.
|
||||
|
||||
To register the model, use the following code:
|
||||
|
||||
@ -51,4 +50,4 @@ def register():
|
||||
|
||||
!!! important
|
||||
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
|
||||
Read more about that [here][supports-multimodal].
|
||||
Read more about that [here](multimodal.md).
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Unit Testing
|
||||
---
|
||||
[](){ #new-model-tests }
|
||||
|
||||
This page explains how to write unit tests to verify the implementation of your model.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Using Docker
|
||||
---
|
||||
[](){ #deployment-docker }
|
||||
|
||||
[](){ #deployment-docker-pre-built-image }
|
||||
|
||||
@ -32,7 +31,7 @@ podman run --gpus all \
|
||||
--model mistralai/Mistral-7B-v0.1
|
||||
```
|
||||
|
||||
You can add any other [engine-args][engine-args] you need after the image tag (`vllm/vllm-openai:latest`).
|
||||
You can add any other [engine-args](../configuration/engine_args.md) you need after the image tag (`vllm/vllm-openai:latest`).
|
||||
|
||||
!!! note
|
||||
You can either use the `ipc=host` flag or `--shm-size` flag to allow the
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Anything LLM
|
||||
---
|
||||
[](){ #deployment-anything-llm }
|
||||
|
||||
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: AutoGen
|
||||
---
|
||||
[](){ #deployment-autogen }
|
||||
|
||||
[AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: BentoML
|
||||
---
|
||||
[](){ #deployment-bentoml }
|
||||
|
||||
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Cerebrium
|
||||
---
|
||||
[](){ #deployment-cerebrium }
|
||||
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Chatbox
|
||||
---
|
||||
[](){ #deployment-chatbox }
|
||||
|
||||
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Dify
|
||||
---
|
||||
[](){ #deployment-dify }
|
||||
|
||||
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: dstack
|
||||
---
|
||||
[](){ #deployment-dstack }
|
||||
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Haystack
|
||||
---
|
||||
[](){ #deployment-haystack }
|
||||
|
||||
# Haystack
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Helm
|
||||
---
|
||||
[](){ #deployment-helm }
|
||||
|
||||
A Helm chart to deploy vLLM for Kubernetes
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: LiteLLM
|
||||
---
|
||||
[](){ #deployment-litellm }
|
||||
|
||||
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Lobe Chat
|
||||
---
|
||||
[](){ #deployment-lobe-chat }
|
||||
|
||||
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: LWS
|
||||
---
|
||||
[](){ #deployment-lws }
|
||||
|
||||
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
|
||||
A major use case is for multi-host/multi-node distributed inference.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Modal
|
||||
---
|
||||
[](){ #deployment-modal }
|
||||
|
||||
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Open WebUI
|
||||
---
|
||||
[](){ #deployment-open-webui }
|
||||
|
||||
1. Install the [Docker](https://docs.docker.com/engine/install/)
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Retrieval-Augmented Generation
|
||||
---
|
||||
[](){ #deployment-retrieval-augmented-generation }
|
||||
|
||||
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: SkyPilot
|
||||
---
|
||||
[](){ #deployment-skypilot }
|
||||
|
||||
<p align="center">
|
||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Streamlit
|
||||
---
|
||||
[](){ #deployment-streamlit }
|
||||
|
||||
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.
|
||||
|
||||
|
||||
@ -1,6 +1,5 @@
|
||||
---
|
||||
title: NVIDIA Triton
|
||||
---
|
||||
[](){ #deployment-triton }
|
||||
|
||||
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: KServe
|
||||
---
|
||||
[](){ #deployment-kserve }
|
||||
|
||||
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: KubeAI
|
||||
---
|
||||
[](){ #deployment-kubeai }
|
||||
|
||||
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Llama Stack
|
||||
---
|
||||
[](){ #deployment-llamastack }
|
||||
|
||||
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: llmaz
|
||||
---
|
||||
[](){ #deployment-llmaz }
|
||||
|
||||
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Production stack
|
||||
---
|
||||
[](){ #deployment-production-stack }
|
||||
|
||||
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Using Kubernetes
|
||||
---
|
||||
[](){ #deployment-k8s }
|
||||
|
||||
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Using Nginx
|
||||
---
|
||||
[](){ #nginxloadbalancer }
|
||||
|
||||
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Architecture Overview
|
||||
---
|
||||
[](){ #arch-overview }
|
||||
|
||||
This document provides an overview of the vLLM architecture.
|
||||
|
||||
@ -74,7 +73,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
|
||||
|
||||
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
|
||||
|
||||
More details on the API server can be found in the [OpenAI-Compatible Server][serving-openai-compatible-server] document.
|
||||
More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
|
||||
|
||||
## LLM Engine
|
||||
|
||||
@ -132,7 +131,7 @@ input tensors and capturing cudagraphs.
|
||||
## Model
|
||||
|
||||
Every model runner object has one model object, which is the actual
|
||||
`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
|
||||
`torch.nn.Module` instance. See [huggingface_integration](huggingface_integration.md) for how various
|
||||
configurations affect the class we ultimately get.
|
||||
|
||||
## Class Hierarchy
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Automatic Prefix Caching
|
||||
---
|
||||
[](){ #design-automatic-prefix-caching }
|
||||
|
||||
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Integration with HuggingFace
|
||||
---
|
||||
[](){ #huggingface-integration }
|
||||
|
||||
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: vLLM Paged Attention
|
||||
---
|
||||
[](){ #design-paged-attention }
|
||||
|
||||
Currently, vLLM utilizes its own implementation of a multi-head query
|
||||
attention kernel (`csrc/attention/attention_kernels.cu`).
|
||||
|
||||
@ -1,9 +1,8 @@
|
||||
---
|
||||
title: Multi-Modal Data Processing
|
||||
---
|
||||
[](){ #mm-processing }
|
||||
|
||||
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
|
||||
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
|
||||
|
||||
Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
|
||||
|
||||
|
||||
@ -1,13 +1,12 @@
|
||||
---
|
||||
title: vLLM's Plugin System
|
||||
---
|
||||
[](){ #plugin-system }
|
||||
|
||||
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
|
||||
|
||||
## How Plugins Work in vLLM
|
||||
|
||||
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
|
||||
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
|
||||
|
||||
## How vLLM Discovers Plugins
|
||||
|
||||
|
||||
@ -1,14 +1,13 @@
|
||||
---
|
||||
title: Automatic Prefix Caching
|
||||
---
|
||||
[](){ #automatic-prefix-caching }
|
||||
|
||||
## Introduction
|
||||
|
||||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
|
||||
|
||||
!!! note
|
||||
Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
|
||||
Technical details on how vLLM implements APC can be found [here](../design/automatic_prefix_caching.md).
|
||||
|
||||
## Enabling APC in vLLM
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Compatibility Matrix
|
||||
---
|
||||
[](){ #compatibility-matrix }
|
||||
|
||||
The tables below show mutually exclusive features and the support on some hardware.
|
||||
|
||||
@ -37,13 +36,13 @@ th:not(:first-child) {
|
||||
}
|
||||
</style>
|
||||
|
||||
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|
||||
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|
||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||||
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
|
||||
| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
|
||||
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
|
||||
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
|
||||
| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
|
||||
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
|
||||
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](gh-issue:7366) | ❌ | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
|
||||
@ -62,10 +61,10 @@ th:not(:first-child) {
|
||||
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU |
|
||||
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----|
|
||||
| [CP][chunked-prefill] | [❌](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [APC][automatic-prefix-caching] | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [APC](automatic_prefix_caching.md) | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8475) | ✅ | ❌ |
|
||||
| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
|
||||
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ |
|
||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Disaggregated Prefilling (experimental)
|
||||
---
|
||||
[](){ #disagg-prefill }
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: LoRA Adapters
|
||||
---
|
||||
[](){ #lora-adapter }
|
||||
|
||||
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Multimodal Inputs
|
||||
---
|
||||
[](){ #multimodal-inputs }
|
||||
|
||||
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Quantization
|
||||
---
|
||||
[](){ #quantization-index }
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: AutoAWQ
|
||||
---
|
||||
[](){ #auto-awq }
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: BitBLAS
|
||||
---
|
||||
[](){ #bitblas }
|
||||
|
||||
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: BitsAndBytes
|
||||
---
|
||||
[](){ #bits-and-bytes }
|
||||
|
||||
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
|
||||
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: FP8 W8A8
|
||||
---
|
||||
[](){ #fp8 }
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: GGUF
|
||||
---
|
||||
[](){ #gguf }
|
||||
|
||||
!!! warning
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: GPTQModel
|
||||
---
|
||||
[](){ #gptqmodel }
|
||||
|
||||
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: INT4 W4A16
|
||||
---
|
||||
[](){ #int4 }
|
||||
|
||||
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: INT8 W8A8
|
||||
---
|
||||
[](){ #int8 }
|
||||
|
||||
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
|
||||
This quantization method is particularly useful for reducing model size while maintaining good performance.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Quantized KV Cache
|
||||
---
|
||||
[](){ #quantized-kvcache }
|
||||
|
||||
## FP8 KV Cache
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: AMD Quark
|
||||
---
|
||||
[](){ #quark }
|
||||
|
||||
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
|
||||
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Supported Hardware
|
||||
---
|
||||
[](){ #quantization-supported-hardware }
|
||||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Reasoning Outputs
|
||||
---
|
||||
[](){ #reasoning-outputs }
|
||||
|
||||
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Speculative Decoding
|
||||
---
|
||||
[](){ #spec-decode }
|
||||
|
||||
!!! warning
|
||||
Please note that speculative decoding in vLLM is not yet optimized and does
|
||||
@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas:
|
||||
3. **vLLM Logprob Stability**
|
||||
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
@ -278,7 +277,7 @@ can occur due to following factors:
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Structured Outputs
|
||||
---
|
||||
[](){ #structured-outputs }
|
||||
|
||||
vLLM supports the generation of structured outputs using
|
||||
[xgrammar](https://github.com/mlc-ai/xgrammar) or
|
||||
@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters:
|
||||
- `guided_grammar`: the output will follow the context free grammar.
|
||||
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
|
||||
|
||||
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
|
||||
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
|
||||
|
||||
Structured outputs are supported by default in the OpenAI-Compatible Server. You
|
||||
may choose to specify the backend to use by setting the
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Installation
|
||||
---
|
||||
[](){ #installation-index }
|
||||
|
||||
vLLM supports the following hardware platforms:
|
||||
|
||||
|
||||
@ -109,8 +109,8 @@ docker run \
|
||||
|
||||
### Supported features
|
||||
|
||||
- [Offline inference][offline-inference]
|
||||
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
|
||||
- [Offline inference](../../serving/offline_inference.md)
|
||||
- Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
|
||||
- HPU autodetection - no need to manually select device within vLLM
|
||||
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
|
||||
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Quickstart
|
||||
---
|
||||
[](){ #quickstart }
|
||||
|
||||
This guide will help you quickly get started with vLLM to perform:
|
||||
|
||||
@ -43,7 +42,7 @@ uv pip install vllm --torch-backend=auto
|
||||
```
|
||||
|
||||
!!! note
|
||||
For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
|
||||
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
|
||||
|
||||
[](){ #quickstart-offline }
|
||||
|
||||
@ -77,7 +76,7 @@ prompts = [
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
```
|
||||
|
||||
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
|
||||
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
|
||||
|
||||
```python
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
|
||||
@ -1,19 +1,19 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
import itertools
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Literal
|
||||
|
||||
import regex as re
|
||||
|
||||
logger = logging.getLogger("mkdocs")
|
||||
|
||||
ROOT_DIR = Path(__file__).parent.parent.parent.parent
|
||||
ROOT_DIR_RELATIVE = '../../../../..'
|
||||
EXAMPLE_DIR = ROOT_DIR / "examples"
|
||||
EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
|
||||
print(ROOT_DIR.resolve())
|
||||
print(EXAMPLE_DIR.resolve())
|
||||
print(EXAMPLE_DOC_DIR.resolve())
|
||||
|
||||
|
||||
def fix_case(text: str) -> str:
|
||||
@ -135,6 +135,11 @@ class Example:
|
||||
|
||||
|
||||
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
|
||||
logger.info("Generating example documentation")
|
||||
logger.debug("Root directory: %s", ROOT_DIR.resolve())
|
||||
logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
|
||||
logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
|
||||
|
||||
# Create the EXAMPLE_DOC_DIR if it doesn't exist
|
||||
if not EXAMPLE_DOC_DIR.exists():
|
||||
EXAMPLE_DOC_DIR.mkdir(parents=True)
|
||||
@ -156,7 +161,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
|
||||
for example in sorted(examples, key=lambda e: e.path.stem):
|
||||
example_name = f"{example.path.stem}.md"
|
||||
doc_path = EXAMPLE_DOC_DIR / example.category / example_name
|
||||
print(doc_path)
|
||||
logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
|
||||
if not doc_path.parent.exists():
|
||||
doc_path.parent.mkdir(parents=True)
|
||||
with open(doc_path, "w+") as f:
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Loading models with Run:ai Model Streamer
|
||||
---
|
||||
[](){ #runai-model-streamer }
|
||||
|
||||
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
|
||||
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Loading models with CoreWeave's Tensorizer
|
||||
---
|
||||
[](){ #tensorizer }
|
||||
|
||||
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
|
||||
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Generative Models
|
||||
---
|
||||
[](){ #generative-models }
|
||||
|
||||
vLLM provides first-class support for generative models, which covers most of LLMs.
|
||||
|
||||
@ -134,7 +133,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
|
||||
|
||||
## Online Serving
|
||||
|
||||
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
|
||||
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
|
||||
|
||||
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
|
||||
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
|
||||
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: TPU
|
||||
---
|
||||
[](){ #tpu-supported-models }
|
||||
|
||||
# TPU Supported Models
|
||||
## Text-only Language Models
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Pooling Models
|
||||
---
|
||||
[](){ #pooling-models }
|
||||
|
||||
vLLM also supports pooling models, including embedding, reranking and reward models.
|
||||
|
||||
@ -11,7 +10,7 @@ before returning them.
|
||||
|
||||
!!! note
|
||||
We currently support pooling models primarily as a matter of convenience.
|
||||
As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to
|
||||
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
|
||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||
|
||||
For pooling models, we support the following `--task` options.
|
||||
@ -113,10 +112,10 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor
|
||||
|
||||
## Online Serving
|
||||
|
||||
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
|
||||
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
|
||||
|
||||
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
|
||||
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models.
|
||||
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
|
||||
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
|
||||
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Supported Models
|
||||
---
|
||||
[](){ #supported-models }
|
||||
|
||||
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
|
||||
If a model supports more than one task, you can set the task via the `--task` argument.
|
||||
@ -34,7 +33,7 @@ llm.apply_model(lambda model: print(type(model)))
|
||||
If it is `TransformersForCausalLM` then it means it's based on Transformers!
|
||||
|
||||
!!! tip
|
||||
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server].
|
||||
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference](../serving/offline_inference.md) or `--model-impl transformers` for the [openai-compatible-server](../serving/openai_compatible_server.md).
|
||||
|
||||
!!! note
|
||||
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
|
||||
@ -53,8 +52,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:
|
||||
|
||||
If the compatible model is:
|
||||
|
||||
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server].
|
||||
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server].
|
||||
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
|
||||
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
|
||||
|
||||
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
|
||||
|
||||
@ -171,7 +170,7 @@ The [Transformers backend][transformers-backend] enables you to run models direc
|
||||
|
||||
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
|
||||
|
||||
Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM.
|
||||
Otherwise, please refer to [Adding a New Model](../contributing/model/README.md) for instructions on how to implement your model in vLLM.
|
||||
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
|
||||
|
||||
#### Download a model
|
||||
@ -308,13 +307,13 @@ print(output)
|
||||
|
||||
### Generative Models
|
||||
|
||||
See [this page][generative-models] for more information on how to use generative models.
|
||||
See [this page](generative_models.md) for more information on how to use generative models.
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
|
||||
@ -412,7 +411,7 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
||||
|
||||
Specified using `--task embed`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||
@ -448,7 +447,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
|
||||
|
||||
Specified using `--task reward`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
@ -466,7 +465,7 @@ If your model is not in the above list, we will try to automatically convert the
|
||||
|
||||
Specified using `--task classify`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
||||
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
|
||||
@ -527,7 +526,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
|
||||
|
||||
- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
|
||||
|
||||
See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model.
|
||||
See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
|
||||
|
||||
!!! important
|
||||
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
|
||||
@ -557,13 +556,13 @@ See [this page][multimodal-inputs] on how to pass multi-modal inputs to the mode
|
||||
|
||||
### Generative Models
|
||||
|
||||
See [this page][generative-models] for more information on how to use generative models.
|
||||
See [this page](generative_models.md) for more information on how to use generative models.
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
||||
| `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | | ✅︎ | ✅︎ |
|
||||
@ -685,7 +684,7 @@ Specified using `--task transcription`.
|
||||
|
||||
Speech2Text models trained specifically for Automatic Speech Recognition.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | |
|
||||
|
||||
@ -708,7 +707,7 @@ Any text generation model can be converted into an embedding model by passing `-
|
||||
|
||||
The following table lists those that are tested in vLLM.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Distributed Inference and Serving
|
||||
---
|
||||
[](){ #distributed-serving }
|
||||
|
||||
## How to decide the distributed inference strategy?
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: LangChain
|
||||
---
|
||||
[](){ #serving-langchain }
|
||||
|
||||
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: LlamaIndex
|
||||
---
|
||||
[](){ #serving-llamaindex }
|
||||
|
||||
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
|
||||
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Offline Inference
|
||||
---
|
||||
[](){ #offline-inference }
|
||||
|
||||
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
|
||||
|
||||
@ -18,8 +17,8 @@ llm = LLM(model="facebook/opt-125m")
|
||||
After initializing the `LLM` instance, use the available APIs to perform model inference.
|
||||
The available APIs depend on the model type:
|
||||
|
||||
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
|
||||
- [Pooling models][pooling-models] output their hidden states directly.
|
||||
- [Generative models](../models/generative_models.md) output logprobs which are sampled from to obtain the final output text.
|
||||
- [Pooling models](../models/pooling_models.md) output their hidden states directly.
|
||||
|
||||
!!! info
|
||||
[API Reference][offline-inference-api]
|
||||
|
||||
@ -1,11 +1,10 @@
|
||||
---
|
||||
title: OpenAI-Compatible Server
|
||||
---
|
||||
[](){ #serving-openai-compatible-server }
|
||||
|
||||
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
|
||||
|
||||
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
|
||||
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`](../configuration/serve_args.md) command. (You can also use our [Docker](../deployment/docker.md) image.)
|
||||
|
||||
```bash
|
||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
|
||||
@ -208,7 +207,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
|
||||
|
||||
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
|
||||
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
|
||||
see our [Multimodal Inputs][multimodal-inputs] guide for more information.
|
||||
see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information.
|
||||
- *Note: `image_url.detail` parameter is not supported.*
|
||||
|
||||
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Frequently Asked Questions
|
||||
---
|
||||
[](){ #faq }
|
||||
|
||||
> Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
@ -12,7 +11,7 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
|
||||
> Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
|
||||
more are listed [here][supported-models].
|
||||
more are listed [here](../models/supported_models.md).
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
|
||||
@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
|
||||
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
|
||||
OpenAI compatible API server.
|
||||
|
||||
You can start the server using Python, or using [Docker][deployment-docker]:
|
||||
You can start the server using Python, or using [Docker](../deployment/docker.md):
|
||||
|
||||
```bash
|
||||
vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
---
|
||||
title: Troubleshooting
|
||||
---
|
||||
[](){ #troubleshooting }
|
||||
|
||||
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
@ -267,7 +266,7 @@ or:
|
||||
ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
|
||||
```
|
||||
|
||||
But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
|
||||
But you are sure that the model is in the [list of supported models](../models/supported_models.md), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.
|
||||
|
||||
## Failed to infer device type
|
||||
|
||||
|
||||
@ -90,7 +90,7 @@ vLLM V1 currently excludes model architectures with the `SupportsV0Only` protoco
|
||||
|
||||
!!! tip
|
||||
|
||||
This corresponds to the V1 column in our [list of supported models][supported-models].
|
||||
This corresponds to the V1 column in our [list of supported models](../models/supported_models.md).
|
||||
|
||||
See below for the status of models that are not yet supported or have more features planned in V1.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user