09e9245478a44faec3c9bc888edea4089085e222
CacheFlow
Installation
pip install psutil numpy ray torch
pip install git+https://github.com/huggingface/transformers # Required for LLaMA.
pip install sentencepiece # Required for LlamaTokenizer.
pip install flash-attn # This may take up to 20 mins.
pip install -e .
Test simple server
ray start --head
python simple_server.py
The detailed arguments for simple_server.py can be found by:
python simple_server.py --help
FastAPI server
Install the following additional dependencies:
pip install fastapi uvicorn
To start the server:
ray start --head
python -m cacheflow.http_frontend.fastapi_frontend
To test the server:
python -m cacheflow.http_frontend.test_cli_client
Gradio web server
Install the following additional dependencies:
pip install gradio
Start the server:
python -m cacheflow.http_frontend.fastapi_frontend
# At another terminal
python -m cacheflow.http_frontend.gradio_webserver
Description
A high-throughput and memory-efficient inference and serving engine for LLMs
amdblackwellcudadeepseekdeepseek-v3gptgpt-ossinferencekimillamallmllm-servingmodel-servingmoeopenaipytorchqwenqwen3tputransformer
Readme
Apache-2.0
687 MiB
Languages
Python
86.1%
Cuda
8.1%
C++
4.2%
Shell
0.7%
C
0.4%
Other
0.4%