diff --git a/docs/ci/update_pytorch_version.md b/docs/ci/update_pytorch_version.md index 69fdc82ef9..eb8f194557 100644 --- a/docs/ci/update_pytorch_version.md +++ b/docs/ci/update_pytorch_version.md @@ -7,9 +7,8 @@ release in CI/CD. It is standard practice to submit a PR to update the PyTorch version as early as possible when a new [PyTorch stable release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available. This process is non-trivial due to the gap between PyTorch -releases. Using [#16859](https://github.com/vllm-project/vllm/pull/16859) as -an example, this document outlines common steps to achieve this update along with -a list of potential issues and how to address them. +releases. Using as an example, this document outlines common steps to achieve this +update along with a list of potential issues and how to address them. ## Test PyTorch release candidates (RCs) @@ -68,7 +67,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod it doesn't populate the cache, so re-running it to warm up the cache is ineffective. -While ongoing efforts like [#17419](https://github.com/vllm-project/vllm/issues/17419) +While ongoing efforts like [#17419](gh-issue:17419) address the long build time at its source, the current workaround is to set VLLM_CI_BRANCH to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`) when manually triggering a build on Buildkite. This branch accomplishes two things: @@ -129,6 +128,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi for different platforms in vLLM CI/CD allows us to selectively choose which platforms to update. For instance, updating XPU requires the corresponding release from https://github.com/intel/intel-extension-for-pytorch by Intel. -While https://github.com/vllm-project/vllm/pull/16859 updated vLLM to PyTorch -2.7.0 on CPU, CUDA, and ROCm, https://github.com/vllm-project/vllm/pull/17444 -completed the update for XPU. +While updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm, + completed the update for XPU. diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index abda7db53f..f28a74ce22 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -217,8 +217,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https A few important things to consider when using the EAGLE based draft models: 1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should - be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304). - If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the + be able to be loaded and used directly by vLLM after . + If you are using vllm version before , please use the [script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model, and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue. @@ -228,7 +228,7 @@ A few important things to consider when using the EAGLE based draft models: 3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under - investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565). + investigation and tracked here: . A variety of EAGLE draft models are available on the Hugging Face hub: diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index 7f1f76ce3d..2b7abc7f46 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -212,7 +212,7 @@ if __name__ == '__main__': ## `torch.compile` Error -vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: +vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: ??? Code @@ -231,7 +231,7 @@ vLLM heavily depends on `torch.compile` to optimize the model for better perform print(f(x)) ``` -If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example. +If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See for example. ## Model failed to be inspected diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 82a2710d89..f2a7679f5c 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -2,7 +2,7 @@ !!! announcement - We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details. + We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details. V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack). @@ -83,7 +83,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the | **Decoder-only Models** | 🚀 Optimized | | **Encoder-Decoder Models** | 🟠 Delayed | | **Embedding Models** | 🟢 Functional | -| **Mamba Models** | 🚧 WIP ([PR #19327](https://github.com/vllm-project/vllm/pull/19327)) | +| **Mamba Models** | 🚧 WIP () | | **Multimodal Models** | 🟢 Functional | vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol. @@ -98,14 +98,14 @@ See below for the status of models that are not yet supported or have more featu The initial basic support is now functional. -Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249), -which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360) +Later, we will consider using [hidden states processor](gh-issue:12249), +which is based on [global logits processor](gh-pr:13360) to enable simultaneous generation and embedding using the same engine instance in V1. #### Mamba Models Models using selective state-space mechanisms instead of standard transformer attention (e.g., `MambaForCausalLM`, `JambaForCausalLM`) -will be supported via [PR #19327](https://github.com/vllm-project/vllm/pull/19327). +will be supported via . #### Encoder-Decoder Models @@ -120,13 +120,13 @@ are not yet supported. | **Chunked Prefill** | 🚀 Optimized | | **LoRA** | 🚀 Optimized | | **Logprobs Calculation** | 🟢 Functional | -| **FP8 KV Cache** | 🟢 Functional on Hopper devices ([PR #15191](https://github.com/vllm-project/vllm/pull/15191))| +| **FP8 KV Cache** | 🟢 Functional on Hopper devices ()| | **Spec Decode** | 🚀 Optimized | -| **Prompt Logprobs with Prefix Caching** | 🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))| +| **Prompt Logprobs with Prefix Caching** | 🟡 Planned ([RFC #13414](gh-issue:13414))| | **Structured Output Alternative Backends** | 🟢 Functional | | **Request-level Structured Output Backend** | 🔴 Deprecated | -| **best_of** | 🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))| -| **Per-Request Logits Processors** | 🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360)) | +| **best_of** | 🔴 Deprecated ([RFC #13361](gh-issue:13361))| +| **Per-Request Logits Processors** | 🔴 Deprecated ([RFC #13360](gh-pr:13360)) | | **GPU <> CPU KV Cache Swapping** | 🔴 Deprecated | !!! note @@ -153,7 +153,7 @@ Support for logprobs with post-sampling adjustments is in progress and will be a **Prompt Logprobs with Prefix Caching** -Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414). +Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414). #### Deprecated Features @@ -161,11 +161,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha **Sampling features** -- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361). +- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). - **Per-Request Logits Processors**: In V0, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting **global logits - processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360). + processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360). **KV Cache features**