Remove unnecessary explicit title anchors and use relative links instead (#20620)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-07-08 10:49:13 +01:00
committed by GitHub
parent b91cb3fa5c
commit b4bab81660
86 changed files with 75 additions and 147 deletions

View File

@ -1,7 +1,6 @@
---
title: Architecture Overview
---
[](){ #arch-overview }
This document provides an overview of the vLLM architecture.
@ -74,7 +73,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the [OpenAI-Compatible Server][serving-openai-compatible-server] document.
More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
## LLM Engine
@ -132,7 +131,7 @@ input tensors and capturing cudagraphs.
## Model
Every model runner object has one model object, which is the actual
`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
`torch.nn.Module` instance. See [huggingface_integration](huggingface_integration.md) for how various
configurations affect the class we ultimately get.
## Class Hierarchy

View File

@ -1,7 +1,6 @@
---
title: Automatic Prefix Caching
---
[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

View File

@ -1,7 +1,6 @@
---
title: Integration with HuggingFace
---
[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.

View File

@ -1,7 +1,6 @@
---
title: vLLM Paged Attention
---
[](){ #design-paged-attention }
Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`).

View File

@ -1,9 +1,8 @@
---
title: Multi-Modal Data Processing
---
[](){ #mm-processing }
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:

View File

@ -1,13 +1,12 @@
---
title: vLLM's Plugin System
---
[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview](arch_overview.md)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins