Derive auto max model len state from original value

2025-10-18 14:49:36 -04:00
parent b10c64c834
commit 14299bfcaf
5 changed files with 97 additions and 2 deletions
--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@ -37,7 +37,8 @@ Dynamic quantization is also supported via the `quantization` option -- see [her
 ## Context length and batch size

 You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
-and the maximum batch size (`max_num_seqs` option).
+and the maximum batch size (`max_num_seqs` option). Setting `max_model_len=-1` lets vLLM automatically
+pick the largest context length that fits in GPU memory, up to the model's default maximum.

 ```python
 from vllm import LLM