[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)

Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-20 18:57:39 +08:00
parent d5b28447e0
commit 63f1fde277
8 changed files with 558 additions and 368 deletions
--- a/docs/source/getting_started/cpu-installation.rst
+++ b/docs/source/getting_started/cpu-installation.rst
@ -5,11 +5,11 @@ Installation with CPU

 vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:

- Tensor Parallel (``-tp = N``)
- Quantization (``INT8 W8A8, AWQ``)
-
-.. note::
-    More advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
+- Tensor Parallel 
+- Model Quantization (``INT8 W8A8, AWQ``)
+- Chunked-prefill
+- Prefix-caching
+- FP8-E5M2 KV-Caching (TODO)

 Table of contents:

--- a/docs/source/serving/compatibility_matrix.rst
+++ b/docs/source/serving/compatibility_matrix.rst
@ -344,7 +344,7 @@ Feature x Hardware
     - ✅
     - ✅
     - ✅
-     - ✗ 
+     - ✅
     - ✅
   * - :ref:`APC <apc>`
     - `✗ <https://github.com/vllm-project/vllm/issues/3687>`__ 
@ -352,7 +352,7 @@ Feature x Hardware
     - ✅
     - ✅
     - ✅
-     - ✗
+     - ✅
     - ✅
   * - :ref:`LoRA <lora>`
     - ✅