d359cda5fae1c9a6fe54ba12b940572edfbf87ac
CacheFlow
Installation
pip install psutil numpy torch transformers
pip install flash-attn # This may take up to 10 mins.
pip install -e .
Run
ray start --head
python server.py [--tensor-parallel-size <N>]
Description
A high-throughput and memory-efficient inference and serving engine for LLMs
amdblackwellcudadeepseekdeepseek-v3gptgpt-ossinferencekimillamallmllm-servingmodel-servingmoeopenaipytorchqwenqwen3tputransformer
Readme
Apache-2.0
687 MiB
Languages
Python
86.1%
Cuda
8.1%
C++
4.2%
Shell
0.7%
C
0.4%
Other
0.4%