Stop using title frontmatter and fix doc that can only be reached by search (#20623)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-08 11:27:40 +01:00
parent b4bab81660
commit b942c094e3
81 changed files with 82 additions and 238 deletions
--- a/docs/.nav.yml
+++ b/docs/.nav.yml
@ -55,6 +55,7 @@ nav:
      - contributing/model/registration.md
      - contributing/model/tests.md
      - contributing/model/multimodal.md
+    - CI: contributing/ci
    - Design Documents:
      - V0: design
      - V1: design/v1
--- a/docs/community/contact_us.md
+++ b/docs/community/contact_us.md
@ -1,5 +1,3 @@
---
-title: Contact Us
---
+# Contact Us

 --8<-- "README.md:contact-us"
--- a/docs/community/meetups.md
+++ b/docs/community/meetups.md
@ -1,6 +1,4 @@
---
-title: Meetups
---
+# Meetups

 We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

--- a/docs/configuration/engine_args.md
+++ b/docs/configuration/engine_args.md
@ -1,6 +1,4 @@
---
-title: Engine Arguments
---
+# Engine Arguments

 Engine arguments control the behavior of the vLLM engine.

--- a/docs/configuration/serve_args.md
+++ b/docs/configuration/serve_args.md
@ -1,6 +1,4 @@
---
-title: Server Arguments
---
+# Server Arguments

 The `vllm serve` command is used to launch the OpenAI-compatible server.

--- a/docs/contributing/benchmarks.md
+++ b/docs/contributing/benchmarks.md
@ -1,6 +1,4 @@
---
-title: Benchmark Suites
---
+# Benchmark Suites

 vLLM contains two sets of benchmarks:

--- a/docs/contributing/ci/failures.md
+++ b/docs/contributing/ci/failures.md
--- a/docs/contributing/ci/update_pytorch_version.md
+++ b/docs/contributing/ci/update_pytorch_version.md
@ -1,6 +1,4 @@
---
-title: Update PyTorch version on vLLM OSS CI/CD
---
+# Update PyTorch version on vLLM OSS CI/CD

 vLLM's current policy is to always use the latest PyTorch stable
 release in CI/CD. It is standard practice to submit a PR to update the
--- a/docs/contributing/model/README.md
+++ b/docs/contributing/model/README.md
@ -1,6 +1,4 @@
---
-title: Summary
---
+# Summary

 !!! important
    Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@ -1,6 +1,4 @@
---
-title: Basic Model
---
+# Basic Model

 This guide walks you through the steps to implement a basic vLLM model.

--- a/docs/contributing/model/multimodal.md
+++ b/docs/contributing/model/multimodal.md
@ -1,6 +1,4 @@
---
-title: Multi-Modal Support
---
+# Multi-Modal Support

 This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).

--- a/docs/contributing/model/registration.md
+++ b/docs/contributing/model/registration.md
@ -1,6 +1,4 @@
---
-title: Registering a Model
---
+# Registering a Model

 vLLM relies on a model registry to determine how to run each model.
 A list of pre-registered architectures can be found [here](../../models/supported_models.md).
--- a/docs/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
@ -1,6 +1,4 @@
---
-title: Unit Testing
---
+# Unit Testing

 This page explains how to write unit tests to verify the implementation of your model.

--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@ -1,6 +1,4 @@
---
-title: Using Docker
---
+# Using Docker

 [](){ #deployment-docker-pre-built-image }

--- a/docs/deployment/frameworks/anyscale.md
+++ b/docs/deployment/frameworks/anyscale.md
@ -1,6 +1,5 @@
---
-title: Anyscale
---
+# Anyscale
+
 [](){ #deployment-anyscale }

 [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
--- a/docs/deployment/frameworks/anything-llm.md
+++ b/docs/deployment/frameworks/anything-llm.md
@ -1,6 +1,4 @@
---
-title: Anything LLM
---
+# Anything LLM

 [Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.

--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@ -1,6 +1,4 @@
---
-title: AutoGen
---
+# AutoGen

 [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.

--- a/docs/deployment/frameworks/bentoml.md
+++ b/docs/deployment/frameworks/bentoml.md
@ -1,6 +1,4 @@
---
-title: BentoML
---
+# BentoML

 [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.

--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@ -1,6 +1,4 @@
---
-title: Cerebrium
---
+# Cerebrium

 <p align="center">
    <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
--- a/docs/deployment/frameworks/chatbox.md
+++ b/docs/deployment/frameworks/chatbox.md
@ -1,6 +1,4 @@
---
-title: Chatbox
---
+# Chatbox

 [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.

--- a/docs/deployment/frameworks/dify.md
+++ b/docs/deployment/frameworks/dify.md
@ -1,6 +1,4 @@
---
-title: Dify
---
+# Dify

 [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.

--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@ -1,6 +1,4 @@
---
-title: dstack
---
+# dstack

 <p align="center">
    <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@ -1,6 +1,4 @@
---
-title: Haystack
---
+# Haystack

 # Haystack

--- a/docs/deployment/frameworks/helm.md
+++ b/docs/deployment/frameworks/helm.md
@ -1,6 +1,4 @@
---
-title: Helm
---
+# Helm

 A Helm chart to deploy vLLM for Kubernetes

--- a/docs/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@ -1,6 +1,4 @@
---
-title: LiteLLM
---
+# LiteLLM

 [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

--- a/docs/deployment/frameworks/lobe-chat.md
+++ b/docs/deployment/frameworks/lobe-chat.md
@ -1,6 +1,4 @@
---
-title: Lobe Chat
---
+# Lobe Chat

 [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.

--- a/docs/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@ -1,6 +1,4 @@
---
-title: LWS
---
+# LWS

 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
 A major use case is for multi-host/multi-node distributed inference.
--- a/docs/deployment/frameworks/modal.md
+++ b/docs/deployment/frameworks/modal.md
@ -1,6 +1,4 @@
---
-title: Modal
---
+# Modal

 vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.

--- a/docs/deployment/frameworks/open-webui.md
+++ b/docs/deployment/frameworks/open-webui.md
@ -1,6 +1,4 @@
---
-title: Open WebUI
---
+# Open WebUI

 1. Install the [Docker](https://docs.docker.com/engine/install/)

--- a/docs/deployment/frameworks/retrieval_augmented_generation.md
+++ b/docs/deployment/frameworks/retrieval_augmented_generation.md
@ -1,6 +1,4 @@
---
-title: Retrieval-Augmented Generation
---
+# Retrieval-Augmented Generation

 [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.

--- a/docs/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@ -1,6 +1,4 @@
---
-title: SkyPilot
---
+# SkyPilot

 <p align="center">
  <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
--- a/docs/deployment/frameworks/streamlit.md
+++ b/docs/deployment/frameworks/streamlit.md
@ -1,6 +1,4 @@
---
-title: Streamlit
---
+# Streamlit

 [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.

--- a/docs/deployment/frameworks/triton.md
+++ b/docs/deployment/frameworks/triton.md
@ -1,5 +1,3 @@
---
-title: NVIDIA Triton
---
+# NVIDIA Triton

 The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
--- a/docs/deployment/integrations/kserve.md
+++ b/docs/deployment/integrations/kserve.md
@ -1,6 +1,4 @@
---
-title: KServe
---
+# KServe

 vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.

--- a/docs/deployment/integrations/kubeai.md
+++ b/docs/deployment/integrations/kubeai.md
@ -1,6 +1,4 @@
---
-title: KubeAI
---
+# KubeAI

 [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

--- a/docs/deployment/integrations/llamastack.md
+++ b/docs/deployment/integrations/llamastack.md
@ -1,6 +1,4 @@
---
-title: Llama Stack
---
+# Llama Stack

 vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

--- a/docs/deployment/integrations/llmaz.md
+++ b/docs/deployment/integrations/llmaz.md
@ -1,6 +1,4 @@
---
-title: llmaz
---
+# llmaz

 [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.

--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@ -1,6 +1,4 @@
---
-title: Production stack
---
+# Production stack

 Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:

--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@ -1,6 +1,4 @@
---
-title: Using Kubernetes
---
+# Using Kubernetes

 Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.

--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@ -1,6 +1,4 @@
---
-title: Using Nginx
---
+# Using Nginx

 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

--- a/docs/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@ -1,6 +1,4 @@
---
-title: Architecture Overview
---
+# Architecture Overview

 This document provides an overview of the vLLM architecture.

--- a/docs/design/automatic_prefix_caching.md
+++ b/docs/design/automatic_prefix_caching.md
@ -1,6 +1,4 @@
---
-title: Automatic Prefix Caching
---
+# Automatic Prefix Caching

 The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

--- a/docs/design/huggingface_integration.md
+++ b/docs/design/huggingface_integration.md
@ -1,6 +1,4 @@
---
-title: Integration with HuggingFace
---
+# Integration with HuggingFace

 This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.

--- a/docs/design/kernel/paged_attention.md
+++ b/docs/design/kernel/paged_attention.md
@ -1,6 +1,4 @@
---
-title: vLLM Paged Attention
---
+# vLLM Paged Attention

 Currently, vLLM utilizes its own implementation of a multi-head query
 attention kernel (`csrc/attention/attention_kernels.cu`).
--- a/docs/design/mm_processing.md
+++ b/docs/design/mm_processing.md
@ -1,6 +1,4 @@
---
-title: Multi-Modal Data Processing
---
+# Multi-Modal Data Processing

 To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.

--- a/docs/design/plugin_system.md
+++ b/docs/design/plugin_system.md
@ -1,6 +1,4 @@
---
-title: vLLM's Plugin System
---
+# vLLM's Plugin System

 The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.

--- a/docs/features/automatic_prefix_caching.md
+++ b/docs/features/automatic_prefix_caching.md
@ -1,6 +1,4 @@
---
-title: Automatic Prefix Caching
---
+# Automatic Prefix Caching

 ## Introduction

--- a/docs/features/compatibility_matrix.md
+++ b/docs/features/compatibility_matrix.md
@ -1,6 +1,4 @@
---
-title: Compatibility Matrix
---
+# Compatibility Matrix

 The tables below show mutually exclusive features and the support on some hardware.

--- a/docs/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@ -1,6 +1,4 @@
---
-title: Disaggregated Prefilling (experimental)
---
+# Disaggregated Prefilling (experimental)

 This page introduces you the disaggregated prefilling feature in vLLM.

--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@ -1,6 +1,4 @@
---
-title: LoRA Adapters
---
+# LoRA Adapters

 This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.

--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@ -1,6 +1,4 @@
---
-title: Multimodal Inputs
---
+# Multimodal Inputs

 This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.

--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@ -1,6 +1,4 @@
---
-title: Quantization
---
+# Quantization

 Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

--- a/docs/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@ -1,6 +1,4 @@
---
-title: AutoAWQ
---
+# AutoAWQ

 To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
 Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@ -1,6 +1,4 @@
---
-title: BitBLAS
---
+# BitBLAS

 vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.

--- a/docs/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@ -1,6 +1,4 @@
---
-title: BitsAndBytes
---
+# BitsAndBytes

 vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
 BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@ -1,6 +1,4 @@
---
-title: FP8 W8A8
---
+# FP8 W8A8

 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
 Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@ -1,6 +1,4 @@
---
-title: GGUF
---
+# GGUF

 !!! warning
    Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
--- a/docs/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@ -1,6 +1,4 @@
---
-title: GPTQModel
---
+# GPTQModel

 To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.

--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@ -1,6 +1,4 @@
---
-title: INT4 W4A16
---
+# INT4 W4A16

 vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).

--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@ -1,6 +1,4 @@
---
-title: INT8 W8A8
---
+# INT8 W8A8

 vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
 This quantization method is particularly useful for reducing model size while maintaining good performance.
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@ -1,6 +1,4 @@
---
-title: Quantized KV Cache
---
+# Quantized KV Cache

 ## FP8 KV Cache

--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@ -1,6 +1,4 @@
---
-title: AMD Quark
---
+# AMD Quark

 Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
 throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
--- a/docs/features/quantization/supported_hardware.md
+++ b/docs/features/quantization/supported_hardware.md
@ -1,6 +1,4 @@
---
-title: Supported Hardware
---
+# Supported Hardware

 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@ -1,6 +1,4 @@
---
-title: Reasoning Outputs
---
+# Reasoning Outputs

 vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.

--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@ -1,6 +1,4 @@
---
-title: Speculative Decoding
---
+# Speculative Decoding

 !!! warning
    Please note that speculative decoding in vLLM is not yet optimized and does
--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@ -1,6 +1,4 @@
---
-title: Structured Outputs
---
+# Structured Outputs

 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or
--- a/docs/getting_started/installation/README.md
+++ b/docs/getting_started/installation/README.md
@ -1,6 +1,4 @@
---
-title: Installation
---
+# Installation

 vLLM supports the following hardware platforms:

--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@ -1,6 +1,4 @@
---
-title: Quickstart
---
+# Quickstart

 This guide will help you quickly get started with vLLM to perform:

--- a/docs/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
@ -1,6 +1,4 @@
---
-title: Loading models with Run:ai Model Streamer
---
+# Loading models with Run:ai Model Streamer

 Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
 Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
--- a/docs/models/extensions/tensorizer.md
+++ b/docs/models/extensions/tensorizer.md
@ -1,6 +1,4 @@
---
-title: Loading models with CoreWeave's Tensorizer
---
+# Loading models with CoreWeave's Tensorizer

 vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
 vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
@ -1,6 +1,4 @@
---
-title: Generative Models
---
+# Generative Models

 vLLM provides first-class support for generative models, which covers most of LLMs.

--- a/docs/models/hardware_supported_models/tpu.md
+++ b/docs/models/hardware_supported_models/tpu.md
@ -1,6 +1,4 @@
---
-title: TPU
---
+# TPU

 # TPU Supported Models
 ## Text-only Language Models
--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@ -1,6 +1,4 @@
---
-title: Pooling Models
---
+# Pooling Models

 vLLM also supports pooling models, including embedding, reranking and reward models.

--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@ -1,6 +1,4 @@
---
-title: Supported Models
---
+# Supported Models

 vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
 If a model supports more than one task, you can set the task via the `--task` argument.
--- a/docs/serving/distributed_serving.md
+++ b/docs/serving/distributed_serving.md
@ -1,6 +1,4 @@
---
-title: Distributed Inference and Serving
---
+# Distributed Inference and Serving

 ## How to decide the distributed inference strategy?

--- a/docs/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
@ -1,6 +1,4 @@
---
-title: LangChain
---
+# LangChain

 vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .

--- a/docs/serving/integrations/llamaindex.md
+++ b/docs/serving/integrations/llamaindex.md
@ -1,6 +1,4 @@
---
-title: LlamaIndex
---
+# LlamaIndex

 vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .

--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@ -1,6 +1,4 @@
---
-title: Offline Inference
---
+# Offline Inference

 Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.

@ -23,7 +21,7 @@ The available APIs depend on the model type:
 !!! info
    [API Reference][offline-inference-api]

-### Ray Data LLM API
+## Ray Data LLM API

 Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
 This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@ -1,6 +1,4 @@
---
-title: OpenAI-Compatible Server
---
+# OpenAI-Compatible Server

 vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.

--- a/docs/usage/faq.md
+++ b/docs/usage/faq.md
@ -1,6 +1,4 @@
---
-title: Frequently Asked Questions
---
+# Frequently Asked Questions

 > Q: How can I serve multiple models on a single port using the OpenAI API?

--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@ -1,6 +1,4 @@
---
-title: Troubleshooting
---
+# Troubleshooting

 This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.