FIX MOE issue in AutoRound format (#18586)

Signed-off-by: wenhuach21 <wenhua.cheng@intel.com>
This commit is contained in:
Wenhua Cheng
2025-05-24 13:01:40 +08:00
committed by GitHub
parent 45ab403a1f
commit ec82c3e388
2 changed files with 31 additions and 29 deletions

View File

@ -58,7 +58,7 @@ vLLM is fast with:
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516),INT4, INT8, and FP8.
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill