[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
This commit is contained in:
Thomas Parnell
2025-08-27 14:51:54 +02:00
committed by GitHub
parent a403d0fa41
commit 704432af3c
2 changed files with 11 additions and 8 deletions

View File

@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i
#### Mamba Models
Models using selective state-space mechanisms instead of standard transformer attention are supported.
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1.
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported.
Please note that prefix caching is not yet supported for these models.
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that
these models currently require disabling prefix caching in V1.
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`).
Please note that prefix caching is not yet supported for these models.
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
Please note that these models currently require disabling prefix caching and enforcing eager mode in V1.
Please note that prefix caching is not yet supported for these models.
It is also necessary to enforce eager mode for these models in V1.
#### Encoder-Decoder Models