[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-29 03:38:29 +00:00
committed by GitHub
parent a7e3eba66f
commit dd6a3a02cb
68 changed files with 2352 additions and 2341 deletions

View File

@ -2,11 +2,11 @@
# AutoAWQ
```{warning}
:::{warning}
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
```
:::
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.

View File

@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
```{note}
:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
```
:::
## Quick Start with Online Dynamic Quantization
@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
result = model.generate("Hello, my name is")
```
```{warning}
:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
```
:::
## Installation
@ -110,9 +110,9 @@ model.generate("Hello my name is")
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
```{note}
:::{note}
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
```
:::
```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
## Deprecated Flow
```{note}
:::{note}
The following information is preserved for reference and search purposes.
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
```
:::
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).

View File

@ -2,13 +2,13 @@
# GGUF
```{warning}
:::{warning}
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
```
:::
```{warning}
:::{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
```
:::
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```
```{warning}
:::{warning}
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
```
:::
You can also use the GGUF model directly through the LLM entrypoint:

View File

@ -4,7 +4,7 @@
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
```{toctree}
:::{toctree}
:caption: Contents
:maxdepth: 1
@ -15,4 +15,4 @@ gguf
int8
fp8
quantized_kvcache
```
:::

View File

@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
```{note}
:::{note}
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
```
:::
## Prerequisites
@ -119,9 +119,9 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
```{note}
:::{note}
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
```
:::
## Best Practices

View File

@ -4,128 +4,129 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
```{list-table}
:::{list-table}
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
* - Implementation
- Volta
- Turing
- Ampere
- Ada
- Hopper
- AMD GPU
- Intel GPU
- x86 CPU
- AWS Inferentia
- Google TPU
* - AWQ
-
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
- ✅︎
- ✅︎
-
-
* - GPTQ
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
- ✅︎
- ✅︎
-
-
* - Marlin (GPTQ/AWQ/FP8)
-
-
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
-
* - INT8 (W8A8)
-
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
-
- ✅︎
-
-
* - FP8 (W8A8)
-
-
-
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
* - AQLM
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
-
* - bitsandbytes
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
-
* - DeepSpeedFP
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
-
* - GGUF
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
- ✅︎
-
-
-
-
```
- * Implementation
* Volta
* Turing
* Ampere
* Ada
* Hopper
* AMD GPU
* Intel GPU
* x86 CPU
* AWS Inferentia
* Google TPU
- * AWQ
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * GPTQ
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * Marlin (GPTQ/AWQ/FP8)
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * INT8 (W8A8)
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
* ✅︎
*
*
- * FP8 (W8A8)
*
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
- * AQLM
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * bitsandbytes
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * DeepSpeedFP
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * GGUF
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
```{note}
:::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
```
:::