[CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
|
||||
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
|
||||
|
||||
```console
|
||||
$ pip install autoawq
|
||||
pip install autoawq
|
||||
```
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
```
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
|
||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||
|
||||
```console
|
||||
$ pip install bitsandbytes>=0.45.0
|
||||
pip install bitsandbytes>=0.45.0
|
||||
```
|
||||
|
||||
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
|
||||
@ -17,7 +17,7 @@ vLLM reads the model's config file and supports both in-flight quantization and
|
||||
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
|
||||
And usually, these repositories have a config.json file that includes a quantization_config section.
|
||||
|
||||
## Read quantized checkpoint.
|
||||
## Read quantized checkpoint
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes", load_format="bitsandbytes")
|
||||
```
|
||||
|
||||
## OpenAI Compatible Server
|
||||
|
||||
Append the following to your 4bit model arguments:
|
||||
|
||||
```
|
||||
```console
|
||||
--quantization bitsandbytes --load-format bitsandbytes
|
||||
```
|
||||
|
||||
@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
|
||||
Install `vllm` and `lm-evaluation-harness`:
|
||||
|
||||
```console
|
||||
$ pip install vllm lm-eval==0.4.4
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
Load and run the model in `vllm`:
|
||||
|
||||
@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
|
||||
To install AMMO (AlgorithMic Model Optimization):
|
||||
|
||||
```console
|
||||
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
|
||||
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
|
||||
```
|
||||
|
||||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
|
||||
|
||||
@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
```console
|
||||
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
```
|
||||
|
||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||
|
||||
```console
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
```{warning}
|
||||
|
||||
@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
|
||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
@ -192,11 +192,11 @@ A few important things to consider when using the EAGLE based draft models:
|
||||
|
||||
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
|
||||
used directly with vLLM due to differences in the expected layer names and model definition.
|
||||
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
|
||||
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
|
||||
to convert them. Note that this script does not modify the model's weights.
|
||||
|
||||
In the above example, use the script to first convert
|
||||
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
|
||||
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
|
||||
and then use the converted checkpoint as the draft model in vLLM.
|
||||
|
||||
2. The EAGLE based draft models need to be run without tensor parallelism
|
||||
@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
|
||||
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
|
||||
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
|
||||
|
||||
|
||||
A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||
|
||||
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
|
||||
@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
|
||||
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
|
||||
|
||||
|
||||
## Lossless guarantees of Speculative Decoding
|
||||
|
||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||
@ -250,8 +248,6 @@ speculative decoding, breaking down the guarantees into three key areas:
|
||||
same request across runs. For more details, see the FAQ section
|
||||
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
|
||||
|
||||
**Conclusion**
|
||||
|
||||
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||
can occur due to following factors:
|
||||
|
||||
@ -259,8 +255,6 @@ can occur due to following factors:
|
||||
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||
due to non-deterministic behavior in batched operations or numerical instability.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
|
||||
|
||||
## Resources for vLLM contributors
|
||||
|
||||
@ -55,21 +55,24 @@ print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
|
||||
```
|
||||
|
||||
Example output:
|
||||
```
|
||||
|
||||
```text
|
||||
Function called: get_weather
|
||||
Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}
|
||||
Result: Getting the weather for San Francisco, CA in fahrenheit...
|
||||
```
|
||||
|
||||
This example demonstrates:
|
||||
- Setting up the server with tool calling enabled
|
||||
- Defining an actual function to handle tool calls
|
||||
- Making a request with `tool_choice="auto"`
|
||||
- Handling the structured response and executing the corresponding function
|
||||
|
||||
* Setting up the server with tool calling enabled
|
||||
* Defining an actual function to handle tool calls
|
||||
* Making a request with `tool_choice="auto"`
|
||||
* Handling the structured response and executing the corresponding function
|
||||
|
||||
You can also specify a particular function using named function calling by setting `tool_choice={"type": "function", "function": {"name": "get_weather"}}`. Note that this will use the guided decoding backend - so the first time this is used, there will be several seconds of latency (or more) as the FSM is compiled for the first time before it is cached for subsequent requests.
|
||||
|
||||
Remember that it's the callers responsibility to:
|
||||
|
||||
1. Define appropriate tools in the request
|
||||
2. Include relevant context in the chat messages
|
||||
3. Handle the tool calls in your application logic
|
||||
@ -77,20 +80,21 @@ Remember that it's the callers responsibility to:
|
||||
For more advanced usage, including parallel tool calls and different model-specific parsers, see the sections below.
|
||||
|
||||
## Named Function Calling
|
||||
|
||||
vLLM supports named function calling in the chat completion API by default. It does so using Outlines through guided decoding, so this is
|
||||
enabled by default, and will work with any supported model. You are guaranteed a validly-parsable function call - not a
|
||||
high-quality one.
|
||||
|
||||
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
|
||||
vLLM will use guided decoding to ensure the response matches the tool parameter object defined by the JSON schema in the `tools` parameter.
|
||||
For best results, we recommend ensuring that the expected output format / schema is specified in the prompt to ensure that the model's intended generation is aligned with the schema that it's being forced to generate by the guided decoding backend.
|
||||
|
||||
To use a named function, you need to define the functions in the `tools` parameter of the chat completion request, and
|
||||
specify the `name` of one of the tools in the `tool_choice` parameter of the chat completion request.
|
||||
|
||||
|
||||
## Automatic Function Calling
|
||||
|
||||
To enable this feature, you should set the following flags:
|
||||
|
||||
* `--enable-auto-tool-choice` -- **mandatory** Auto tool choice. tells vLLM that you want to enable the model to generate its own tool calls when it
|
||||
deems appropriate.
|
||||
* `--tool-call-parser` -- select the tool parser to use (listed below). Additional tool parsers
|
||||
@ -104,28 +108,28 @@ from HuggingFace; and you can find an example of this in a `tokenizer_config.jso
|
||||
|
||||
If your favorite tool-calling model is not supported, please feel free to contribute a parser & tool use chat template!
|
||||
|
||||
|
||||
### Hermes Models (`hermes`)
|
||||
|
||||
All Nous Research Hermes-series models newer than Hermes 2 Pro should be supported.
|
||||
|
||||
* `NousResearch/Hermes-2-Pro-*`
|
||||
* `NousResearch/Hermes-2-Theta-*`
|
||||
* `NousResearch/Hermes-3-*`
|
||||
|
||||
|
||||
_Note that the Hermes 2 **Theta** models are known to have degraded tool call quality & capabilities due to the merge
|
||||
step in their creation_.
|
||||
|
||||
Flags: `--tool-call-parser hermes`
|
||||
|
||||
|
||||
### Mistral Models (`mistral`)
|
||||
|
||||
Supported models:
|
||||
|
||||
* `mistralai/Mistral-7B-Instruct-v0.3` (confirmed)
|
||||
* Additional mistral function-calling models are compatible as well.
|
||||
|
||||
Known issues:
|
||||
|
||||
1. Mistral 7B struggles to generate parallel tool calls correctly.
|
||||
2. Mistral's `tokenizer_config.json` chat template requires tool call IDs that are exactly 9 digits, which is
|
||||
much shorter than what vLLM generates. Since an exception is thrown when this condition
|
||||
@ -136,13 +140,12 @@ it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated
|
||||
* `examples/tool_chat_template_mistral_parallel.jinja` - this is a "better" version that adds a tool-use system prompt
|
||||
when tools are provided, that results in much better reliability when working with parallel tool calling.
|
||||
|
||||
|
||||
Recommended flags: `--tool-call-parser mistral --chat-template examples/tool_chat_template_mistral_parallel.jinja`
|
||||
|
||||
|
||||
### Llama Models (`llama3_json`)
|
||||
|
||||
Supported models:
|
||||
|
||||
* `meta-llama/Meta-Llama-3.1-8B-Instruct`
|
||||
* `meta-llama/Meta-Llama-3.1-70B-Instruct`
|
||||
* `meta-llama/Meta-Llama-3.1-405B-Instruct`
|
||||
@ -152,6 +155,7 @@ The tool calling that is supported is the [JSON based tool calling](https://llam
|
||||
Other tool calling formats like the built in python tool calling or custom tool calling are not supported.
|
||||
|
||||
Known issues:
|
||||
|
||||
1. Parallel tool calls are not supported.
|
||||
2. The model can generate parameters with a wrong format, such as generating
|
||||
an array serialized as string instead of an array.
|
||||
@ -164,6 +168,7 @@ Recommended flags: `--tool-call-parser llama3_json --chat-template examples/tool
|
||||
#### IBM Granite
|
||||
|
||||
Supported models:
|
||||
|
||||
* `ibm-granite/granite-3.0-8b-instruct`
|
||||
|
||||
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
|
||||
@ -182,42 +187,45 @@ Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/t
|
||||
|
||||
`examples/tool_chat_template_granite_20b_fc.jinja`: this is a modified chat template from the original on Huggingface, which is not vLLM compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
|
||||
|
||||
|
||||
### InternLM Models (`internlm`)
|
||||
|
||||
Supported models:
|
||||
|
||||
* `internlm/internlm2_5-7b-chat` (confirmed)
|
||||
* Additional internlm2.5 function-calling models are compatible as well
|
||||
|
||||
Known issues:
|
||||
|
||||
* Although this implementation also supports InternLM2, the tool call results are not stable when testing with the `internlm/internlm2-chat-7b` model.
|
||||
|
||||
Recommended flags: `--tool-call-parser internlm --chat-template examples/tool_chat_template_internlm2_tool.jinja`
|
||||
|
||||
|
||||
### Jamba Models (`jamba`)
|
||||
|
||||
AI21's Jamba-1.5 models are supported.
|
||||
|
||||
* `ai21labs/AI21-Jamba-1.5-Mini`
|
||||
* `ai21labs/AI21-Jamba-1.5-Large`
|
||||
|
||||
|
||||
Flags: `--tool-call-parser jamba`
|
||||
|
||||
|
||||
### Models with Pythonic Tool Calls (`pythonic`)
|
||||
|
||||
A growing number of models output a python list to represent tool calls instead of using JSON. This has the advantage of inherently supporting parallel tool calls and removing ambiguity around the JSON schema required for tool calls. The `pythonic` tool parser can support such models.
|
||||
|
||||
As a concrete example, these models may look up the weather in San Francisco and Seattle by generating:
|
||||
|
||||
```python
|
||||
[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]
|
||||
```
|
||||
|
||||
Limitations:
|
||||
|
||||
* The model must not generate both text and tool calls in the same generation. This may not be hard to change for a specific model, but the community currently lacks consensus on which tokens to emit when starting and ending tool calls. (In particular, the Llama 3.2 models emit no such tokens.)
|
||||
* Llama's smaller models struggle to use tools effectively.
|
||||
|
||||
Example supported models:
|
||||
|
||||
* `meta-llama/Llama-3.2-1B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
|
||||
* `meta-llama/Llama-3.2-3B-Instruct`\* (use with `examples/tool_chat_template_llama3.2_pythonic.jinja`)
|
||||
* `Team-ACE/ToolACE-8B` (use with `examples/tool_chat_template_toolace.jinja`)
|
||||
@ -231,7 +239,6 @@ Llama's smaller models frequently fail to emit tool calls in the correct format.
|
||||
|
||||
---
|
||||
|
||||
|
||||
## How to write a tool parser plugin
|
||||
|
||||
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py.
|
||||
@ -284,7 +291,8 @@ class ExampleToolParser(ToolParser):
|
||||
```
|
||||
|
||||
Then you can use this plugin in the command line like this.
|
||||
```
|
||||
|
||||
```console
|
||||
--enable-auto-tool-choice \
|
||||
--tool-parser-plugin <absolute path of the plugin file>
|
||||
--tool-call-parser example \
|
||||
|
||||
Reference in New Issue
Block a user