Remove unnecessary explicit title anchors and use relative links instead (#20620)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-08 10:49:13 +01:00
parent b91cb3fa5c
commit b4bab81660
86 changed files with 75 additions and 147 deletions
--- a/docs/features/automatic_prefix_caching.md
+++ b/docs/features/automatic_prefix_caching.md
@ -1,14 +1,13 @@
 ---
 title: Automatic Prefix Caching
 ---
-[](){ #automatic-prefix-caching }

 ## Introduction

 Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.

 !!! note
-    Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
+    Technical details on how vLLM implements APC can be found [here](../design/automatic_prefix_caching.md).

 ## Enabling APC in vLLM

--- a/docs/features/compatibility_matrix.md
+++ b/docs/features/compatibility_matrix.md
@ -1,7 +1,6 @@
 ---
 title: Compatibility Matrix
 ---
-[](){ #compatibility-matrix }

 The tables below show mutually exclusive features and the support on some hardware.

@ -37,13 +36,13 @@ th:not(:first-child) {
 }
 </style>

-| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
+| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 | [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
-| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
-| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
+| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
+| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
 | <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
-| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
+| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
 | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
 | <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [❌](gh-issue:7366) | ❌ | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
@ -62,10 +61,10 @@ th:not(:first-child) {
 | Feature                                                   | Volta               | Turing    | Ampere    | Ada    | Hopper     | CPU                | AMD    | TPU |
 |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----|
 | [CP][chunked-prefill]                                     | [❌](gh-issue:2729) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ |
-| [APC][automatic-prefix-caching]                           | [❌](gh-issue:3687) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ |
-| [LoRA][lora-adapter]                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ |
+| [APC](automatic_prefix_caching.md)                           | [❌](gh-issue:3687) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ |
+| [LoRA](lora.md)                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ |
 | <abbr title="Prompt Adapter">prmpt adptr</abbr>           | ✅                  | ✅        | ✅        | ✅     | ✅        | [❌](gh-issue:8475) | ✅     | ❌ |
-| [SD][spec-decode]                                         | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ❌ |
+| [SD](spec_decode.md)                                         | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ❌ |
 | CUDA graph                                                | ✅                  | ✅        | ✅        | ✅     | ✅        | ❌                  | ✅     | ❌ |
 | <abbr title="Pooling Models">pooling</abbr>               | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ❔     | ❌ |
 | <abbr title="Encoder-Decoder Models">enc-dec</abbr>       | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ❌     | ❌ |
--- a/docs/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@ -1,7 +1,6 @@
 ---
 title: Disaggregated Prefilling (experimental)
 ---
-[](){ #disagg-prefill }

 This page introduces you the disaggregated prefilling feature in vLLM.

--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@ -1,7 +1,6 @@
 ---
 title: LoRA Adapters
 ---
-[](){ #lora-adapter }

 This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.

--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@ -1,7 +1,6 @@
 ---
 title: Multimodal Inputs
 ---
-[](){ #multimodal-inputs }

 This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.

--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@ -1,7 +1,6 @@
 ---
 title: Quantization
 ---
-[](){ #quantization-index }

 Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

--- a/docs/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@ -1,7 +1,6 @@
 ---
 title: AutoAWQ
 ---
-[](){ #auto-awq }

 To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
 Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@ -1,7 +1,6 @@
 ---
 title: BitBLAS
 ---
-[](){ #bitblas }

 vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.

--- a/docs/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@ -1,7 +1,6 @@
 ---
 title: BitsAndBytes
 ---
-[](){ #bits-and-bytes }

 vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
 BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@ -1,7 +1,6 @@
 ---
 title: FP8 W8A8
 ---
-[](){ #fp8 }

 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
 Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@ -1,7 +1,6 @@
 ---
 title: GGUF
 ---
-[](){ #gguf }

 !!! warning
    Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
--- a/docs/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@ -1,7 +1,6 @@
 ---
 title: GPTQModel
 ---
-[](){ #gptqmodel }

 To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.

--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@ -1,7 +1,6 @@
 ---
 title: INT4 W4A16
 ---
-[](){ #int4 }

 vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).

--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@ -1,7 +1,6 @@
 ---
 title: INT8 W8A8
 ---
-[](){ #int8 }

 vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
 This quantization method is particularly useful for reducing model size while maintaining good performance.
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@ -1,7 +1,6 @@
 ---
 title: Quantized KV Cache
 ---
-[](){ #quantized-kvcache }

 ## FP8 KV Cache

--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@ -1,7 +1,6 @@
 ---
 title: AMD Quark
 ---
-[](){ #quark }

 Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
 throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
--- a/docs/features/quantization/supported_hardware.md
+++ b/docs/features/quantization/supported_hardware.md
@ -1,7 +1,6 @@
 ---
 title: Supported Hardware
 ---
-[](){ #quantization-supported-hardware }

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@ -1,7 +1,6 @@
 ---
 title: Reasoning Outputs
 ---
-[](){ #reasoning-outputs }

 vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.

--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@ -1,7 +1,6 @@
 ---
 title: Speculative Decoding
 ---
-[](){ #spec-decode }

 !!! warning
    Please note that speculative decoding in vLLM is not yet optimized and does
@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas:
 3. **vLLM Logprob Stability**
   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
   same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).

 While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
 can occur due to following factors:
@ -278,7 +277,7 @@ can occur due to following factors:
 - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
  due to non-deterministic behavior in batched operations or numerical instability.

-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).

 ## Resources for vLLM contributors

--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@ -1,7 +1,6 @@
 ---
 title: Structured Outputs
 ---
-[](){ #structured-outputs }

 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or
@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_grammar`: the output will follow the context free grammar.
 - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.

-You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.

 Structured outputs are supported by default in the OpenAI-Compatible Server. You
 may choose to specify the backend to use by setting the