[Doc] Update docs on handling OOM (#15357)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com>
This commit is contained in:
@ -2,7 +2,12 @@
|
||||
|
||||
# Engine Arguments
|
||||
|
||||
Below, you can find an explanation of every engine argument for vLLM:
|
||||
Engine arguments control the behavior of the vLLM engine.
|
||||
|
||||
- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class.
|
||||
- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`.
|
||||
|
||||
Below, you can find an explanation of every engine argument:
|
||||
|
||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
|
||||
```{eval-rst}
|
||||
@ -15,7 +20,7 @@ Below, you can find an explanation of every engine argument for vLLM:
|
||||
|
||||
## Async Engine Arguments
|
||||
|
||||
Below are the additional arguments related to the asynchronous engine:
|
||||
Additional arguments are available to the asynchronous engine which is used for online serving:
|
||||
|
||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
|
||||
```{eval-rst}
|
||||
|
||||
@ -97,6 +97,13 @@ llm = LLM(model="adept/fuyu-8b",
|
||||
max_num_seqs=2)
|
||||
```
|
||||
|
||||
#### Adjust cache size
|
||||
|
||||
If you run out of CPU RAM, try the following options:
|
||||
|
||||
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
|
||||
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
|
||||
|
||||
### Performance optimization and tuning
|
||||
|
||||
You can potentially improve the performance of vLLM by finetuning various options.
|
||||
|
||||
Reference in New Issue
Block a user