[Docs] Replace all explicit anchors with real links (#27087)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@ -153,7 +153,7 @@ VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
|
||||
|
||||
### Pre-built images
|
||||
|
||||
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
||||
|
||||
### Build image from source
|
||||
|
||||
|
||||
@ -15,7 +15,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
|
||||
|
||||
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
|
||||
|
||||
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below][build-from-source] for more details.
|
||||
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
@ -44,8 +44,6 @@ export CUDA_VERSION=118 # or 126
|
||||
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
|
||||
```
|
||||
|
||||
[](){ #install-the-latest-code }
|
||||
|
||||
#### Install the latest code
|
||||
|
||||
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on an x86 platform with CUDA 12 for every commit since `v0.5.3`.
|
||||
@ -128,11 +126,11 @@ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vll
|
||||
uv pip install --editable .
|
||||
```
|
||||
|
||||
You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
|
||||
You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).
|
||||
|
||||
!!! note
|
||||
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
|
||||
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [install-the-latest-code][install-the-latest-code] for instructions on how to install a specified wheel.
|
||||
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [Install the latest code](#install-the-latest-code) for instructions on how to install a specified wheel.
|
||||
|
||||
#### Full build (with compilation)
|
||||
|
||||
@ -250,7 +248,7 @@ uv pip install -e .
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:pre-built-images]
|
||||
|
||||
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
|
||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.
|
||||
|
||||
Another way to access the latest code is to use the docker images:
|
||||
|
||||
@ -266,11 +264,11 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
|
||||
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
|
||||
See [Building vLLM's Docker Image from Source](../../deployment/docker.md#building-vllms-docker-image-from-source) for instructions on building the Docker image.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
|
||||
@ -66,8 +66,6 @@ vLLM is a Python library that supports the following GPU variants. Select your G
|
||||
|
||||
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
|
||||
|
||||
[](){ #build-from-source }
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
=== "NVIDIA CUDA"
|
||||
|
||||
@ -217,6 +217,6 @@ Where the `<path/to/model>` is the location where the model is stored, for examp
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:supported-features]
|
||||
|
||||
See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
|
||||
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
|
||||
|
||||
# --8<-- [end:supported-features]
|
||||
|
||||
@ -2,8 +2,8 @@
|
||||
|
||||
This guide will help you quickly get started with vLLM to perform:
|
||||
|
||||
- [Offline batched inference][quickstart-offline]
|
||||
- [Online serving using OpenAI-compatible server][quickstart-online]
|
||||
- [Offline batched inference](#offline-batched-inference)
|
||||
- [Online serving using OpenAI-compatible server](#openai-compatible-server)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
@ -42,8 +42,6 @@ uv pip install vllm --torch-backend=auto
|
||||
!!! note
|
||||
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
|
||||
|
||||
[](){ #quickstart-offline }
|
||||
|
||||
## Offline Batched Inference
|
||||
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
|
||||
@ -57,7 +55,7 @@ The first line of this example imports the classes [LLM][vllm.LLM] and [Sampling
|
||||
from vllm import LLM, SamplingParams
|
||||
```
|
||||
|
||||
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
|
||||
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](../api/README.md#inference-parameters).
|
||||
|
||||
!!! important
|
||||
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
|
||||
@ -135,8 +133,6 @@ for output in outputs:
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
[](){ #quickstart-online }
|
||||
|
||||
## OpenAI-Compatible Server
|
||||
|
||||
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
||||
@ -150,7 +146,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||
|
||||
!!! note
|
||||
By default, the server uses a predefined chat template stored in the tokenizer.
|
||||
You can learn about overriding it [here][chat-template].
|
||||
You can learn about overriding it [here](../serving/openai_compatible_server.md#chat-template).
|
||||
!!! important
|
||||
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user