Compare commits

...

183 Commits

Author SHA1 Message Date
c3d9640b09 Use gpu_1_queue 2025-07-30 18:08:58 -04:00
02f7780716 Consolidate args
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-07-30 14:40:00 -04:00
61d568a19d Update dockerfile image
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-30 13:01:55 -04:00
629468aa71 Use flashinfer-python location
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-07-30 12:58:07 -04:00
b15005dc12 Simplify!
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-07-30 12:41:14 -04:00
67ba6a9487 Fix index
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-07-30 11:56:36 -04:00
f626cc9300 Add FlashInfer wheel building capability to buildkite pipeline
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-07-30 11:48:04 -04:00
176bbce1db Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" (#21850)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-07-29 21:56:29 +00:00
a1873db23d docker: docker-aware precompiled wheel support (#21127)
Signed-off-by: dougbtv <dosmith@redhat.com>
2025-07-29 14:45:19 -07:00
a33ea28b1b Add flashinfer_python to CUDA wheel requirements (#21389)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-29 12:51:58 -07:00
7b49cb1c6b [Doc] update Contributing page's testing section (#18272)
Signed-off-by: David Xia <david@davidxia.com>
2025-07-29 10:32:46 -07:00
f03e9cf2bb [Doc] Add FusedMoE Modular Kernel Documentation (#21623)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-29 10:32:30 -07:00
37f86d9048 [Docs] use uv in GPU installation docs (#20277)
Signed-off-by: David Xia <david@davidxia.com>
2025-07-29 10:32:06 -07:00
58b11b24a6 [Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-07-29 10:34:00 -04:00
ad341c5194 [Bugfix]fix mixed bits and visual language model quantization in AutoRound (#21802)
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com>
2025-07-29 07:26:31 -07:00
759b87ef3e [TPU] Add an optimization doc on TPU (#21155)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-29 07:23:19 -07:00
f693b067a2 [Docs] Merge design docs for a V1 only future (#21832)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-29 07:22:50 -07:00
04e38500ee [Bugfix] VLLM_V1 supports passing other compilation levels (#19340)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-07-29 09:35:58 -04:00
ab714131e4 [Doc] Update compatibility matrix for pooling and multimodal models (#21831)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-29 06:29:51 -07:00
755fa8b657 [KVCache] Make KVCacheSpec hashable (#21791)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-07-29 19:58:29 +08:00
2470419119 [Docs] Fix the outdated URL for installing from vLLM binaries (#21523)
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-29 04:56:27 -07:00
61a6905ab0 [Model] Refactor JambaForCausalLM (#21394)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-29 18:25:07 +08:00
37efc63b64 [V0 deprecation] Guided decoding (#21347)
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-29 03:15:30 -07:00
a4528f0cac [Model]: Fused MoE for nomic-embed-text-v2-moe (#18321)
Signed-off-by: isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-29 03:13:27 -07:00
a2480251ec [Doc] Link to RFC for pooling optimizations (#21806)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 23:53:18 -07:00
7234fe2685 [Misc] Rework process titles (#21780)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-29 05:14:47 +00:00
f1e2c095ec Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema (#21684)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-28 22:09:45 -07:00
12a223ef9b [AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM (#21766)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-07-29 03:35:37 +00:00
e18f085103 skip fusedmoe layer for start_load_kv (#21378)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
2025-07-28 18:59:44 -07:00
afa2607596 [CI] Parallelize Kernels MoE Test (#21764)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-28 18:56:24 -07:00
48b763d6b5 [Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod (#21775)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-28 19:47:21 -06:00
947e982ede [Docs] Minimize spacing for supported_hardware.md table (#21779) 2025-07-28 18:46:39 -07:00
c6c9122d50 [Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning (#20396)
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Duncan Moss <djm.moss@gmail.com>
2025-07-28 23:13:58 +00:00
8aa1485fcf [Perf] Disable chunked local attention by default with llama4 (#21761)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-28 18:49:04 -04:00
89ac266b26 [Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels (#17112)
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-28 20:55:15 +00:00
c6f36cfa26 [Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() (#21472)
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-28 20:51:22 +00:00
b18b417fbf Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" (#21778)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-07-28 20:15:18 +00:00
9ba1c88a93 [AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-07-28 20:11:16 +00:00
e0e58f9729 [Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant (#21773)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-28 19:55:48 +00:00
b361f14e39 [AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile (#21350)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-07-28 15:38:20 -04:00
01c753ed98 update flashinfer to v0.2.9rc2 (#21701)
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
2025-07-28 19:31:47 +00:00
94b71ae106 Use metavar to list the choices for a CLI arg when custom values are also accepted (#21760)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-28 19:31:10 +00:00
7d44c691b0 [P/D] Log warnings related to prefill KV expiry (#21753)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-28 18:40:53 +00:00
e17a4d3bf9 [Bugfix] Fix granite speech shape validation (#21762)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 14:19:21 -04:00
ec261b0291 [XPU] IPEX-optimized Punica Wrapper on XPU (#21703)
Signed-off-by: chzhang <chaojun.zhang@intel.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-28 16:43:37 +00:00
04fe61aa3d [CI/Build] Fix plugin tests (#21758)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 15:08:05 +00:00
25708d317a [Bugfix] Mistral crashes on tool with no description (#21167)
Signed-off-by: HugoMichard <hugo@harfanglab.fr>
2025-07-28 08:03:35 -07:00
0e18a5d058 [Misc] Reduce logs for model resolution (#21765)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 07:59:56 -07:00
34a20c49b3 [Logs] Change flashinfer sampler logs to once (#21759)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-28 06:59:51 -07:00
31084b3b1f [Bugfix][CI/Build] Update peft version in test requirement (#21729)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-28 06:17:43 -07:00
bccc43c033 [Bugfix]check health for engine core process exiting unexpectedly (#21728)
Signed-off-by: wuhang <wuhang6@huawei.com>
2025-07-28 06:17:31 -07:00
1395dd9c28 [Docs] Add revision date to rendered docs (#21752)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-28 06:12:46 -07:00
9ace2eaf35 [Bugfix] Improve JSON extraction in LlamaToolParser (#19024)
Signed-off-by: keru <keyang.ru@oracle.com>
Co-authored-by: keru <keyang.ru@oracle.com>
2025-07-28 12:36:58 +00:00
656c24f1b5 [Ernie 4.5] Name Change for Base 0.3B Model (#21735)
Signed-off-by: vasqu <antonprogamer@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 12:22:32 +00:00
63fe3a700f [PD] let p2p nccl toy proxy handle /chat/completions (#21734)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-07-28 11:45:50 +00:00
0ae970ed15 [Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme (#21744)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-28 04:26:49 -07:00
65e8466c37 [Bugfix] Fix environment variable setting in CPU Dockerfile (#21730)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-28 11:02:39 +00:00
1b769dccf3 [Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts (#21717)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-28 11:02:25 +00:00
2cc571199b [feature] add log non default args in LLM (#21680)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-07-28 02:21:22 -07:00
a4ed731546 [Model] Prioritize Transformers fallback over suffix matching (#21719)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 02:15:31 -07:00
d128d0d554 Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema (#21686)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-28 01:16:35 -07:00
a6c050286a [v1][mamba] Added mamba_type into MambaSpec (#21715)
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-07-28 08:15:55 +00:00
139a7f07bd [BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled (#21707)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-28 07:18:47 +00:00
150d9e6337 [Bugfix] fix max-file-size type from str to int (#21675)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-28 00:06:52 -07:00
139a97ec56 [Bugfix] Fix shape checking for Fuyu (#21709)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-28 00:05:56 -07:00
18cc33dd60 [bugfix] fix profile impact benchmark results (#21507)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-07-27 22:44:24 -07:00
7656cf4cf3 [Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled (#21573)
Signed-off-by: wangzi <3220100013@zju.edu.cn>
Co-authored-by: wangzi <3220100013@zju.edu.cn>
2025-07-27 22:43:50 -07:00
3ea57a56d9 Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … (#21683)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-27 22:37:23 -07:00
75856bc2cb Migrate GraniteSpeechAudioInputs to TensorSchema (#21682)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-07-27 22:37:20 -07:00
304dcdf575 Migrate GLMVImagePixelInputs to TensorSchema (#21679)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-27 22:36:11 -07:00
88e46c7c8d Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema (#21678)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-07-27 22:36:08 -07:00
d8937de4c8 Migrate Gemma3ImagePixelInputs to TensorSchema (#21676)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-27 22:36:05 -07:00
e626d286f5 [FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel (#21242) 2025-07-28 05:07:06 +00:00
c7ffe93d9c [Model] Support TP/PP/mamba2 kernel for PLaMo2 (#19674)
Signed-off-by: Shinichi Hemmi <shemmi@preferred.jp>
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com>
Co-authored-by: Calvin Metzger <metzger@preferred.jp>
Co-authored-by: Sixue Wang <cecilwang@preferred.jp>
2025-07-28 05:00:47 +00:00
15a72ac478 [V1] Exception Handling when Loading KV Cache from Remote Store (#21534)
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
2025-07-27 20:34:17 -07:00
04ff4be310 [Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 (#21700)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-27 20:12:18 -07:00
93269bb43e Fix GLM tool parser (#21668)
Co-authored-by: Chenhui Zhang <zhang.chenhui@outlook.com>
2025-07-28 10:46:38 +08:00
82acf2184d Fix typo for limit-mm-per-prompt in docs (#21697)
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
2025-07-27 19:45:37 -07:00
86ae693f20 [Deprecation][2/N] Replace --task with --runner and --convert (#21470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-27 19:42:40 -07:00
8f605ee309 [Attention] Make CutlassMLA the default backend for SM100 (blackwell) (#21626)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-27 20:13:00 +00:00
a9b2a1d704 [Misc] Refactor vllm config str (#21666) 2025-07-27 09:51:44 -07:00
57c22e57f9 Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
2025-07-27 07:08:00 -07:00
bda9d0535f [Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor (#21631)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-27 05:25:21 -07:00
3d847a3125 [VLM] Add video support for Intern-S1 (#21671)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-27 11:49:43 +00:00
5f8c9a425e Migrate Florence2ImagePixelInputs to TensorSchema (#21663)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-27 02:43:02 -07:00
1cbf951ba2 [Misc] add default value for file pattern arg (#21659)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-27 05:14:51 +00:00
a8936e5193 Refactor: Remove numpy dependency from LoggingStatLogger (#20529)
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-07-27 04:06:21 +00:00
01a395e9e7 [CI/Build][Doc] Clean up more docs that point to old bench scripts (#21667)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-07-27 04:02:12 +00:00
971948b846 Handle non-serializable objects in vllm bench (#21665) 2025-07-27 03:35:22 +00:00
eed2f463b2 [VLM] Support HF format Phi-4-MM model (#17121)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-07-26 20:07:57 -07:00
20950b29fb Migrate ChameleonImagePixelInputs to TensorSchema (#21657)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-26 19:34:25 -07:00
3339cba3ff Migrate FuyuImagePatchInputs to TensorSchema (#21662)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-26 19:34:14 -07:00
0b8caf9095 Migrate DeepseekVL2ImageInputs to TensorSchema (#21658)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-26 19:34:11 -07:00
ccf27cc4d4 Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema (#21656)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-27 10:33:52 +08:00
c657369841 support torch.compile for bailing moe (#21664) 2025-07-26 23:54:32 +00:00
6c66f28fa5 Remove xformers requirement for Mistral-format Pixtral and Mistral3 (#21154)
Signed-off-by: Wenchen Lo <charles761013@gmail.com>
2025-07-26 17:20:29 -06:00
de509ae8eb [NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels (#21411)
Signed-off-by: kaixih <kaixih@nvidia.com>
2025-07-26 07:10:36 -07:00
e7c4f9ee86 [CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI (#21355)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-07-26 07:10:14 -07:00
9094d11c5d [Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon (#21380)
Signed-off-by: Yeju Zhou <yejuzhou@outlook.com>
2025-07-26 07:09:57 -07:00
56e544f24b [Refactor] Remove moe_align_block_size_triton (#21335)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-26 07:08:29 -07:00
97d6c30cc9 [BugFix] Fix shared storage connector load kv only load attention layer (#21428)
Signed-off-by: David Chen <530634352@qq.com>
2025-07-26 07:07:40 -07:00
a40a8506df [Misc] Improve memory profiling debug message (#21429)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-07-26 07:07:21 -07:00
c215f5c877 [Bug] Fix has_flashinfer_moe Import Error when it is not installed (#21634)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-26 07:06:14 -07:00
1cd6eaba54 Support encoder-only models without KV-Cache (#21270)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-07-26 21:09:52 +08:00
f27fdfc3ed [Bugfix] Investigate Qwen2-VL failing test (#21527)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-26 06:09:29 -07:00
de10ff0b7c Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation (#21622)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-26 06:08:18 -07:00
9d197280fa Migrate AriaImagePixelInputs to TensorSchema for shape validation (#21620)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-26 06:08:15 -07:00
e98def439c [Take 2] Correctly kill vLLM processes after benchmarks (#21646)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-07-26 06:06:05 -07:00
05c1126f29 [Misc] remove unused try-except in pooling config check (#21618)
Signed-off-by: reidliu41 <reid201711@gmail.com>
2025-07-26 12:20:03 +00:00
875af38e01 Support Intern-S1 (#21628)
Signed-off-by: Roger Wang <hey@rogerw.me>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Roger Wang <hey@rogerw.me>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-26 19:14:04 +08:00
7728dd77bb [TPU][Test] Divide TPU v1 Test into 2 parts. (#21431) 2025-07-26 06:20:30 +00:00
2f6e6b33fb [Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison (#21612)
Signed-off-by: Alexandre Juan <a.juan@netheos.net>
2025-07-25 20:11:10 -07:00
a55c95096b Correctly kill vLLM processes after finishing serving benchmarks (#21641)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-07-25 19:06:21 -07:00
97349fe2bc [Docs] add offline serving multi-modal video input expamle Qwen2.5-VL (#21530)
Signed-off-by: David Chen <530634352@qq.com>
2025-07-25 18:37:32 -07:00
62965de5fe [Model] Ultravox: Support Llama 4 and Gemma 3 backends (#17818)
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>
Signed-off-by: Patrick Li <patrick8289@gmail.com>
Co-authored-by: Patrick Li <patrick8289@gmail.com>
2025-07-25 18:12:31 -07:00
7ae75fa6d0 [Feature] Add support for MoE models in the calibration-free RTN-based quantization (#20766)
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
2025-07-25 18:09:34 -07:00
f1b286b2fb [TPU] Update ptxla nightly version to 20250724 (#21555)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-25 17:09:00 -07:00
c7742d6113 [Bugfix] Always set RAY_ADDRESS for Ray actor before spawn (#21540)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-07-25 17:08:30 -07:00
cea96a0156 [Bugfix] Fix sync_and_slice_intermediate_tensors (#21537)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-07-25 17:07:58 -07:00
2eddd437ba Add interleaved RoPE test for Llama4 (Maverick) (#21478)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-07-25 17:07:26 -07:00
75d29cf4e1 [Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-25 17:07:07 -07:00
41d3082c41 Add Unsloth to RLHF.md (#21636) 2025-07-25 17:06:48 -07:00
7cfea0df39 [TPU][Test] Rollback PR-21550. (#21619)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-07-25 13:22:01 -07:00
5ac3168ee3 [Docs] add auto-round quantization readme (#21600)
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-25 08:52:42 -07:00
396ee94180 [CI] Unifying Dockerfiles for ARM and X86 Builds (#21343)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-07-25 07:33:56 -07:00
e189b50f53 Add support for Prithvi in Online serving mode (#21518)
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-25 07:01:27 -07:00
136d750f5f [Kernel] Improve machete memory bound perf (#21556)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-25 06:53:21 -07:00
b3caeb82e7 [ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
2025-07-25 06:50:21 -07:00
eab2f3980c [Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel (#20839)
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Yu Chin Fabian Lim <fabian.lim@gmail.com>
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Yu Chin Fabian Lim <fabian.lim@gmail.com>
2025-07-25 06:49:36 -07:00
9fe98d4250 [Frontend] Add request_id to the Request object so they can be controlled better via external load balancers (#21009)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
2025-07-25 06:49:11 -07:00
29c6fbe58c [MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B (#20931)
Signed-off-by: bigshanedogg <bigshane319@gmail.com>
2025-07-25 06:05:42 -07:00
c72f049cb4 [Model] Fix Ernie4.5MoE e_score_correction_bias parameter (#21586)
Signed-off-by: zhouchong <zhouchong03@baidu.com>
Co-authored-by: zhouchong <zhouchong03@baidu.com>
2025-07-25 06:02:53 -07:00
f3a683b7c9 [Bugfix][Logprobs] Fix logprobs op to support more backend (#21591)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-07-25 05:53:07 -07:00
46d81d6951 [V1] Get supported tasks from model runner instead of model config (#21585)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-25 05:36:45 -07:00
5c3f2628d5 [Quantization] Enable BNB support for more MoE models (#21370)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-25 03:57:34 -07:00
7311f74468 [Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' (#21579)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-07-25 03:42:23 -07:00
8ed01e32f7 Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#21598)
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
2025-07-25 02:36:55 -07:00
e38e96a3c0 [Tests] Harden DP tests (#21508)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-25 02:27:24 -07:00
40d86ee412 [TPU][Bugfix] fix OOM issue in CI test (#21550)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-24 23:01:53 -07:00
85d051f026 [Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS (#21262)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-07-24 22:54:23 -07:00
5140f54b89 [CI/Build] fix cpu_extension for apple silicon (#21195)
Signed-off-by: ignaciosica <mignacio.sica@gmail.com>
2025-07-24 22:53:59 -07:00
947edd099e [Misc][Tools] make max-model-len a parameter in auto_tune script (#21321)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-24 22:46:43 -07:00
fde60ee775 [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings (#21479)
Signed-off-by: Hongmin Fan <fanhongmin@google.com>
2025-07-25 13:46:06 +08:00
b38bc652ac [Model] Support tensor parallel for timm ViT in Deepseek_vl2 (#21494)
Signed-off-by: wzqd <1057337859@qq.com>
2025-07-24 22:45:16 -07:00
adaf2c6d4f [Bugfix] fix modelscope snapshot_download serialization (#21536)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-24 22:44:38 -07:00
42343f1f89 [CI] Update CODEOWNERS for CPU and Intel GPU (#21582)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-24 21:58:03 -07:00
965bc71b04 Integrate TensorSchema with shape validation for Phi3VImagePixelInputs (#21232)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-07-24 21:43:52 -07:00
807a328bb6 [Docs] Add requirements/common.txt to run unit tests (#21572)
Signed-off-by: Zhou Fang <fang.github@gmail.com>
2025-07-24 20:51:15 -07:00
e0be2c4d09 [TPU][Test] Temporarily suspend this MoE model in test_basic.py. (#21560)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-07-24 20:44:50 -07:00
9c8b2c2a8a [DP] Support api-server-count > 0 in hybrid DP LB mode (#21510)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-24 20:18:16 -07:00
2212cd6cfb [Bugfix] DeepGemm utils : Fix hardcoded type-cast (#21517)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-24 20:17:29 -07:00
ce3a9b1378 [Kernel] adding fused_moe configs for upcoming granite4 (#21332)
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-07-24 20:16:59 -07:00
2ce90e5b01 Fix GLM-4 PP Missing Layer When using with PP. (#21531)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
2025-07-24 20:07:38 -07:00
633f6e804b [Bug] Fix DeepGemm Init Error (#21554)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-24 20:07:22 -07:00
b57296bb9a [Docs] Fix site_url for RunLLM (#21564)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 20:05:58 -07:00
34ddcf9ff4 [Frontend] run-batch supports V1 (#21541)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-24 20:05:55 -07:00
fe56180c7f [MoE] More balanced expert sharding (#21497)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
2025-07-24 15:56:08 -07:00
07d80d7b0e [TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. (#21539)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-07-24 15:33:04 -07:00
2dd72d23d9 update flashinfer to v0.2.9rc1 (#21485)
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
2025-07-24 14:06:11 -07:00
a6c7fb8cff [Docs] Add Expert Parallelism Initial Documentation (#21373)
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 12:36:06 -07:00
a7272c23d0 [Docs][minor] Fix broken gh-file link in distributed serving docs (#21543)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
2025-07-24 10:36:56 -07:00
6066284914 [P/D] Support CPU Transfer in NixlConnector (#18293)
Signed-off-by: Juncheng Gu <juncgu@gmail.com>
Signed-off-by: Richard Liu <ricliu@google.com>
Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com>
Co-authored-by: Richard Liu <ricliu@google.com>
2025-07-24 17:58:42 +01:00
1e9ea8e69d [P/D] Move FakeNixlWrapper to test dir (#21328)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-07-24 08:53:45 -07:00
d9f9a3fd96 [XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform (#21036)
Signed-off-by: chzhang <chaojun.zhang@intel.com>
2025-07-24 23:23:36 +08:00
1b25f1fe75 Update flashinfer CUTLASS MoE Kernel (#21408)
Signed-off-by: Shu Wang. <shuw@nvidia.com>
2025-07-24 08:13:31 -07:00
e8cb0d0495 [Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access (#21465)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-24 08:13:24 -07:00
684174115d [Docs] Rewrite Distributed Inference and Serving guide (#20593)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 08:13:05 -07:00
cdb79ee63d [Docs] Update Tensorizer usage documentation (#21190)
Signed-off-by: Sanger Steel <sangersteel@gmail.com>
Signed-off-by: William Goldby <willgoldby@gmail.com>
Co-authored-by: William Goldby <willgoldby@gmail.com>
2025-07-24 06:56:18 -07:00
5a19a6c670 [Fix] Update mamba_ssm to 2.2.5 (#21421)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-07-24 03:25:41 -07:00
2ded067fd2 [Bugfix] Fix CUDA arch flags for MoE permute (#21426)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-24 03:23:59 -07:00
13abd0eaf9 [Model] Officially support Emu3 with Transformers backend (#21319)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 03:22:12 -07:00
61b8cea3b4 [Attention] Optimize FlashInfer MetadataBuilder Build call (#21137)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-07-24 03:21:46 -07:00
526078a96c bump flashinfer to v0.2.8 (#21385)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2025-07-24 03:20:38 -07:00
6da0078523 [Feat] Allow custom naming of vLLM processes (#21445)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-07-24 03:15:23 -07:00
73e3949d07 [Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() (#21501)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-07-24 03:13:40 -07:00
6eca337ce0 Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 (#20544)
Signed-off-by: okada <kokuzen@gmail.com>
Signed-off-by: okada shintarou <okada@preferred.jp>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 02:56:36 -07:00
85bda9e7d0 remove GLM-4.5 quantization wrong Code (#21435) 2025-07-24 01:52:43 -07:00
610852a423 [Core] Support model loader plugins (#21067)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-07-24 01:49:44 -07:00
f0f4de8f26 [Misc] Fix duplicate FusedMoEConfig debug messages (#21455)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-24 01:27:30 -07:00
fc5f756db4 [v1][Core] Clean up usages of SpecializedManager (#21407)
Signed-off-by: Zhou Fang <fang.github@gmail.com>
2025-07-24 00:40:11 -07:00
e74bfc70e4 [TPU][Bugfix] fix moe layer (#21340)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-07-24 00:38:39 -07:00
90eeea8f85 [Bugfix][ROCm] Fix for warp_size uses on host (#21205)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-07-24 00:37:19 -07:00
dde295a934 Deduplicate Transformers backend code using inheritance (#21461)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-24 00:16:23 -07:00
402 changed files with 16076 additions and 8354 deletions

View File

@ -74,7 +74,7 @@ Here is an example of one test inside `latency-tests.json`:
In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
@ -82,13 +82,13 @@ WARNING: The benchmarking script will save json results by itself, so please do
### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
```json
[
@ -118,8 +118,8 @@ Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

View File

@ -100,7 +100,7 @@ if __name__ == "__main__":
raw_result = json.loads(f.read())
if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`
# this result is generated via `vllm bench serve` command
# attach the benchmarking command to raw_result
try:
@ -120,7 +120,7 @@ if __name__ == "__main__":
continue
elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`
# this result is generated via `vllm bench latency` command
# attach the benchmarking command to raw_result
try:
@ -148,7 +148,7 @@ if __name__ == "__main__":
continue
elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`
# this result is generated via `vllm bench throughput` command
# attach the benchmarking command to raw_result
try:

View File

@ -73,7 +73,7 @@ get_current_llm_serving_engine() {
echo "Container: vllm"
# move to a completely irrelevant directory, to avoid import vllm from current folder
export CURRENT_LLM_SERVING_ENGINE=vllm
return
fi
}
@ -95,12 +95,14 @@ json2args() {
}
kill_gpu_processes() {
pkill -f python
pkill -f python3
pkill -f tritonserver
pkill -f pt_main_thread
pkill -f text-generation
pkill -f lmdeploy
pkill -f '[p]ython'
pkill -f '[p]ython3'
pkill -f '[t]ritonserver'
pkill -f '[p]t_main_thread'
pkill -f '[t]ext-generation'
pkill -f '[l]mdeploy'
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pkill -f '[V]LLM'
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
@ -125,7 +127,7 @@ ensure_installed() {
}
run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases
local serving_test_file
@ -225,7 +227,7 @@ run_serving_tests() {
if [[ "$dataset_name" = "sharegpt" ]]; then
client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
@ -246,7 +248,7 @@ run_serving_tests() {
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
@ -265,13 +267,13 @@ run_serving_tests() {
$client_args"
else
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
exit 1
fi
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
@ -302,7 +304,7 @@ run_serving_tests() {
}
run_genai_perf_tests() {
# run genai-perf tests
# run genai-perf tests
# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
@ -311,14 +313,14 @@ run_genai_perf_tests() {
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
@ -369,10 +371,10 @@ run_genai_perf_tests() {
qps=$num_prompts
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE
if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
@ -413,7 +415,7 @@ prepare_dataset() {
do
cat sonnet.txt >> sonnet_4x.txt
done
}
main() {

View File

@ -126,7 +126,8 @@ kill_gpu_processes() {
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
# wait until GPU memory usage smaller than 1GB
if command -v nvidia-smi; then
@ -164,7 +165,7 @@ upload_to_buildkite() {
}
run_latency_tests() {
# run latency tests using `benchmark_latency.py`
# run latency tests using `vllm bench latency` command
# $1: a json file specifying latency test cases
local latency_test_file
@ -205,7 +206,7 @@ run_latency_tests() {
fi
fi
latency_command=" $latency_envs python3 benchmark_latency.py \
latency_command=" $latency_envs vllm bench latency \
--output-json $RESULTS_FOLDER/${test_name}.json \
$latency_args"
@ -231,7 +232,7 @@ run_latency_tests() {
}
run_throughput_tests() {
# run throughput tests using `benchmark_throughput.py`
# run throughput tests using `vllm bench throughput`
# $1: a json file specifying throughput test cases
local throughput_test_file
@ -272,7 +273,7 @@ run_throughput_tests() {
fi
fi
throughput_command=" $throughput_envs python3 benchmark_throughput.py \
throughput_command=" $throughput_envs vllm bench throughput \
--output-json $RESULTS_FOLDER/${test_name}.json \
$throughput_args"
@ -297,7 +298,7 @@ run_throughput_tests() {
}
run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases
local serving_test_file
@ -393,7 +394,7 @@ run_serving_tests() {
# pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard
client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
@ -447,7 +448,7 @@ main() {
(which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof)
# get the current IP address, required by benchmark_serving.py
# get the current IP address, required by `vllm bench serve` command
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# turn of the reporting of the status of each request, to clean up the terminal output
export VLLM_LOGGING_LEVEL="WARNING"

View File

@ -41,6 +41,20 @@ steps:
env:
DOCKER_BUILDKIT: "1"
- block: "Build FlashInfer wheel"
key: block-build-flashinfer-wheel
depends_on: ~
- label: "Build and upload FlashInfer wheel - CUDA 12.8"
depends_on: block-build-flashinfer-wheel
id: build-upload-flashinfer-wheel
agents:
queue: gpu_1_queue
commands:
- "bash .buildkite/scripts/build-upload-flashinfer-wheel.sh 12.8.1"
env:
DOCKER_BUILDKIT: "1"
- block: "Build release image"
depends_on: ~
key: block-release-image-build

View File

@ -0,0 +1,57 @@
#!/usr/bin/env bash
set -ex
CUDA_VERSION="${1:-12.8.1}"
# FlashInfer version controlled in tools/flashinfer-build.sh
echo "Building FlashInfer wheel for CUDA ${CUDA_VERSION} using vLLM Dockerfile"
# Build the FlashInfer wheel using the existing Dockerfile stage
DOCKER_BUILDKIT=1 docker build \
--build-arg max_jobs=16 \
--build-arg USE_SCCACHE=1 \
--build-arg CUDA_VERSION="${CUDA_VERSION}" \
--tag flashinfer-wheel-builder:${CUDA_VERSION} \
--target flashinfer-wheel-builder \
--progress plain \
-f docker/Dockerfile .
# Extract the wheel
mkdir -p artifacts/dist
docker run --rm -v $(pwd)/artifacts:/output_host flashinfer-wheel-builder:${CUDA_VERSION} \
bash -c 'cp /output/*.whl /output_host/dist/ && chmod -R a+rw /output_host'
# Upload the wheel to S3
echo "Uploading FlashInfer wheel to S3..."
wheel_files=(artifacts/dist/*.whl)
# Check that exactly one wheel is found
if [[ ${#wheel_files[@]} -ne 1 ]]; then
echo "Error: Expected exactly one wheel file in artifacts/dist/, but found ${#wheel_files[@]}"
exit 1
fi
# Get the single wheel file
wheel="${wheel_files[0]}"
echo "Processing FlashInfer wheel: $wheel"
# Rename 'linux' to 'manylinux1' in the wheel filename for compatibility
new_wheel="${wheel/linux/manylinux1}"
if [[ "$wheel" != "$new_wheel" ]]; then
mv -- "$wheel" "$new_wheel"
wheel="$new_wheel"
echo "Renamed wheel to: $wheel"
fi
# Extract the version from the wheel
version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
wheel_name=$(basename "$wheel")
echo "FlashInfer version: $version"
# Upload the wheel to S3 under flashinfer-python directory
aws s3 cp "$wheel" "s3://vllm-wheels/flashinfer-python/"
echo "✅ FlashInfer wheel built and uploaded successfully for CUDA ${CUDA_VERSION}"
echo "📦 Wheel: $wheel_name (version $version)"
ls -la artifacts/dist/

View File

@ -13,9 +13,9 @@ NUMA_NODE=${NUMA_NODE:-1}
export CMAKE_BUILD_PARALLEL_LEVEL=32
# Setup cleanup
remove_docker_container() {
set -e;
docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
remove_docker_container() {
set -e;
docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
}
trap remove_docker_container EXIT
remove_docker_container
@ -69,7 +69,7 @@ function cpu_tests() {
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
# Note: disable it until supports V1
# Run AWQ test
@ -78,23 +78,23 @@ function cpu_tests() {
# VLLM_USE_V1=0 pytest -s -v \
# tests/quantization/test_ipex_quant.py"
# online serving
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions'
# Run multi-lora tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
# online serving
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
vllm bench serve \
--backend vllm \
--dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions'
}
# All of CPU tests are expected to be finished less than 40 mins.

View File

@ -0,0 +1,166 @@
#!/bin/bash
set -xu
remove_docker_container() {
docker rm -f tpu-test || true;
docker rm -f vllm-tpu || true;
}
trap remove_docker_container EXIT
# Remove the container that might not be cleaned up in the previous run.
remove_docker_container
# Build the docker image.
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
# Set up cleanup.
cleanup_docker() {
# Get Docker's root directory
docker_root=$(docker info -f '{{.DockerRootDir}}')
if [ -z "$docker_root" ]; then
echo "Failed to determine Docker root directory."
exit 1
fi
echo "Docker root directory: $docker_root"
# Check disk usage of the filesystem where Docker's root directory is located
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
# Define the threshold
threshold=70
if [ "$disk_usage" -gt "$threshold" ]; then
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune --force --filter "until=72h" --all
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
fi
}
cleanup_docker
# For HF_TOKEN.
source /etc/environment
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c '
set -e # Exit immediately if a command exits with a non-zero status.
set -u # Treat unset variables as an error.
echo "--- Starting script inside Docker container ---"
# Create results directory
RESULTS_DIR=$(mktemp -d)
# If mktemp fails, set -e will cause the script to exit.
echo "Results will be stored in: $RESULTS_DIR"
# Install dependencies
echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \
&& python3 -m pip install --progress-bar off hf-transfer
echo "--- Python dependencies installed ---"
export VLLM_USE_V1=1
export VLLM_XLA_CHECK_RECOMPILATION=1
export VLLM_XLA_CACHE_PATH=
echo "Using VLLM V1"
echo "--- Hardware Information ---"
# tpu-info
echo "--- Starting Tests ---"
set +e
overall_script_exit_code=0
# --- Test Definitions ---
# If a test fails, this function will print logs and will not cause the main script to exit.
run_test() {
local test_num=$1
local test_name=$2
local test_command=$3
local log_file="$RESULTS_DIR/test_${test_num}.log"
local actual_exit_code
echo "--- TEST_$test_num: Running $test_name ---"
# Execute the test command.
eval "$test_command" > >(tee -a "$log_file") 2> >(tee -a "$log_file" >&2)
actual_exit_code=$?
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" # This goes to main log
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" >> "$log_file" # Also to per-test log
if [ "$actual_exit_code" -ne 0 ]; then
echo "TEST_$test_num ($test_name) FAILED with exit code $actual_exit_code." >&2
echo "--- Log for failed TEST_$test_num ($test_name) ---" >&2
if [ -f "$log_file" ]; then
cat "$log_file" >&2
else
echo "Log file $log_file not found for TEST_$test_num ($test_name)." >&2
fi
echo "--- End of log for TEST_$test_num ($test_name) ---" >&2
return "$actual_exit_code" # Return the failure code
else
echo "TEST_$test_num ($test_name) PASSED."
return 0 # Return success
fi
}
# Helper function to call run_test and update the overall script exit code
run_and_track_test() {
local test_num_arg="$1"
local test_name_arg="$2"
local test_command_arg="$3"
# Run the test
run_test "$test_num_arg" "$test_name_arg" "$test_command_arg"
local test_specific_exit_code=$?
# If the test failed, set the overall script exit code to 1
if [ "$test_specific_exit_code" -ne 0 ]; then
# No need for extra echo here, run_test already logged the failure.
overall_script_exit_code=1
fi
}
# --- Actual Test Execution ---
run_and_track_test 1 "test_struct_output_generate.py" \
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
run_and_track_test 2 "test_moe_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
run_and_track_test 3 "test_lora.py" \
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
run_and_track_test 4 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 5 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 6 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
# After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then
echo "--- One or more tests FAILED. Overall script exiting with failure code 1. ---"
else
echo "--- All tests have completed and PASSED. Overall script exiting with success code 0. ---"
fi
exit "$overall_script_exit_code"
' # IMPORTANT: This is the closing single quote for the bash -c "..." command. Ensure it is present and correct.
# Capture the exit code of the docker run command
DOCKER_RUN_EXIT_CODE=$?
# The trap will run for cleanup.
# Exit the main script with the Docker run command's exit code.
if [ "$DOCKER_RUN_EXIT_CODE" -ne 0 ]; then
echo "Docker run command failed with exit code $DOCKER_RUN_EXIT_CODE."
exit "$DOCKER_RUN_EXIT_CODE"
else
echo "Docker run command completed successfully."
exit 0
fi
# TODO: This test fails because it uses RANDOM_SEED sampling
# pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \

View File

@ -135,7 +135,7 @@ run_and_track_test 1 "test_compilation.py" \
run_and_track_test 2 "test_basic.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py"
run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \
"python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine"
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine"
run_and_track_test 4 "test_quantization_accuracy.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py"
run_and_track_test 5 "examples/offline_inference/tpu.py" \
@ -150,18 +150,6 @@ run_and_track_test 9 "test_multimodal.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py"
run_and_track_test 10 "test_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py"
run_and_track_test 11 "test_struct_output_generate.py" \
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
run_and_track_test 12 "test_moe_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
run_and_track_test 13 "test_lora.py" \
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
run_and_track_test 14 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 15 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 16 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
# After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then

View File

@ -11,10 +11,10 @@ cd "$(dirname "${BASH_SOURCE[0]}")/../.."
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
# run python-based benchmarks and upload the result to buildkite
python3 benchmarks/benchmark_latency.py --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
vllm bench latency --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
bench_latency_exit_code=$?
python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
vllm bench throughput --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
bench_throughput_exit_code=$?
# run server-based benchmarks and upload the result to buildkite
@ -24,7 +24,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
# wait for server to start, timeout after 600 seconds
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \

View File

@ -77,7 +77,7 @@ done
echo "run benchmark test..."
echo "logging to $BM_LOG"
echo
python benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model $MODEL \
--dataset-name sonnet \

View File

@ -128,11 +128,10 @@ steps:
- tests/entrypoints/offline_mode
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
- label: Entrypoints Test (API Server) # 40min
@ -403,17 +402,18 @@ steps:
- vllm/model_executor/layers/quantization
- tests/kernels/quantization
commands:
- pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
- pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 2
- label: Kernels MoE Test
- label: Kernels MoE Test %N
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- csrc/moe/
- tests/kernels/moe
- vllm/model_executor/layers/fused_moe/
commands:
- pytest -v -s kernels/moe
- pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 2
- label: Kernels Mamba Test
mirror_hardwares: [amdexperimental, amdproduction]

15
.github/CODEOWNERS vendored
View File

@ -10,7 +10,6 @@
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
/vllm/model_executor/guided_decoding @mgoin @russellb @aarnphm
/vllm/multimodal @DarkLight1337 @ywang96
/vllm/vllm_flash_attn @LucasWilkinson
/vllm/lora @jeejeelee
@ -35,9 +34,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm
/tests/entrypoints/llm/test_guided_generate.py @mgoin @russellb @aarnphm
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/model_executor/test_guided_processors.py @mgoin @russellb
/tests/models @DarkLight1337 @ywang96
/tests/multi_step @alexm-redhat @comaniac
/tests/multimodal @DarkLight1337 @ywang96
@ -52,3 +49,15 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
# Docs
/docs @hmellor
mkdocs.yaml @hmellor
# CPU
/vllm/v1/worker/^cpu @bigPYJ1151
/csrc/cpu @bigPYJ1151
/vllm/platforms/cpu.py @bigPYJ1151
/cmake/cpu_extension.cmake @bigPYJ1151
/docker/Dockerfile.cpu @bigPYJ1151
# Intel GPU
/vllm/v1/worker/^xpu @jikunshang
/vllm/platforms/xpu.py @jikunshang
/docker/Dockerfile.xpu @jikunshang

3
.github/mergify.yml vendored
View File

@ -149,9 +149,6 @@ pull_request_rules:
- files=examples/offline_inference/structured_outputs.py
- files=examples/online_serving/openai_chat_completion_structured_outputs.py
- files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py
- files~=^vllm/model_executor/guided_decoding/
- files=tests/model_executor/test_guided_processors.py
- files=tests/entrypoints/llm/test_guided_generate.py
- files~=^tests/v1/structured_output/
- files=tests/v1/entrypoints/llm/test_guided_generate.py
- files~=^vllm/v1/structured_output/

View File

@ -7,7 +7,7 @@ permissions:
jobs:
lint-and-deploy:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04-arm
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -7,6 +7,9 @@ build:
os: ubuntu-22.04
tools:
python: "3.12"
jobs:
post_checkout:
- git fetch --unshallow || true
mkdocs:
configuration: mkdocs.yaml

View File

@ -635,7 +635,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"in CUDA target architectures.")
endif()
endif()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu")
@ -768,6 +768,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu")
endif()
if(VLLM_GPU_LANG STREQUAL "CUDA")
set(MOE_PERMUTE_SRC
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
"csrc/moe/moe_permute_unpermute_op.cu")
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
endif()
set_gencode_flags_for_srcs(
SRCS "${VLLM_MOE_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
@ -836,17 +844,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()
if(VLLM_GPU_LANG STREQUAL "CUDA")
set(MOE_PERMUTE_SRC
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
"csrc/moe/moe_permute_unpermute_op.cu")
set_gencode_flags_for_srcs(
SRCS "${MARLIN_PERMUTE_SRC}"
CUDA_ARCHS "${MOE_PERMUTE_ARCHS}")
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
endif()
message(STATUS "Enabling moe extension.")
define_gpu_extension_target(
_moe_C

View File

@ -98,7 +98,7 @@ Then run the benchmarking script
```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \
@ -111,25 +111,25 @@ If successful, you will see the following output
```
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total Token throughput (tok/s): 619.85
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total Token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
```
@ -141,7 +141,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
{"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"}
```
```
```bash
# start server
@ -150,7 +150,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests
```bash
# run benchmarking script
python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \
vllm bench serve --port 9001 --save-result --save-detailed \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/completions \
@ -174,7 +174,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
```
```bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
@ -194,7 +194,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
```
``` bash
python3 benchmarks/benchmark_serving.py \
vllm bench serve \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset-name hf \
--dataset-path likaixin/InstructCoder \
@ -210,7 +210,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
**`lmms-lab/LLaVA-OneVision-Data`**
```bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
@ -224,7 +224,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
**`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
@ -237,7 +237,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
**`AI-MO/aimo-validation-aime`**
``` bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--model Qwen/QwQ-32B \
--dataset-name hf \
--dataset-path AI-MO/aimo-validation-aime \
@ -248,7 +248,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
**`philschmid/mt-bench`**
``` bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--model Qwen/QwQ-32B \
--dataset-name hf \
--dataset-path philschmid/mt-bench \
@ -261,7 +261,7 @@ When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command:
```bash
python3 vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \
@ -296,7 +296,7 @@ The following arguments can be used to control the ramp-up:
<br/>
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset-name sonnet \
--dataset-path vllm/benchmarks/sonnet.txt \
@ -314,7 +314,7 @@ Total num output tokens: 1500
**VisionArena Benchmark for Vision Language Models**
``` bash
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
@ -336,7 +336,7 @@ Total num output tokens: 1280
``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--dataset-name=hf \
--dataset-path=likaixin/InstructCoder \
--model=meta-llama/Meta-Llama-3-8B-Instruct \
@ -360,7 +360,7 @@ Total num output tokens: 204800
**`lmms-lab/LLaVA-OneVision-Data`**
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
@ -373,7 +373,7 @@ python3 vllm/benchmarks/benchmark_throughput.py \
**`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
@ -385,7 +385,7 @@ python3 vllm/benchmarks/benchmark_throughput.py \
**`AI-MO/aimo-validation-aime`**
```bash
python3 benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model Qwen/QwQ-32B \
--backend vllm \
--dataset-name hf \
@ -399,7 +399,7 @@ python3 benchmarks/benchmark_throughput.py \
``` bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--model meta-llama/Llama-2-7b-hf \
--backend vllm \
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \

View File

@ -39,6 +39,7 @@ You must set the following variables at the top of the script before execution.
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
| `MAX_MODEL_LEN` | **Required.** Max model length. | `4096` |
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
@ -69,6 +70,7 @@ Here are a few examples of how to configure the script for different goals:
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
```
@ -80,6 +82,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=500
```
@ -91,6 +94,7 @@ MAX_LATENCY_ALLOWED_MS=500
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=60
MAX_LATENCY_ALLOWED_MS=500
```
@ -101,7 +105,7 @@ After the script finishes, you will find the results in a new, timestamped direc
- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
- `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run.
- `bm_log_...txt`: The log output from the `vllm bench serve` command for each benchmark run.
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.

View File

@ -1,16 +1,18 @@
#!/bin/bash
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
# See details in README (benchmarks/auto_tune/README.md).
TAG=$(date +"%Y_%m_%d_%H_%M")
BASE=""
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
BASE="$SCRIPT_DIR/../../.."
MODEL="meta-llama/Llama-3.1-8B-Instruct"
SYSTEM="TPU"
TP=1
DOWNLOAD_DIR=""
INPUT_LEN=4000
OUTPUT_LEN=16
MAX_MODEL_LEN=4096
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=100000000000
NUM_SEQS_LIST="128 256"
@ -36,6 +38,13 @@ current_hash=$(git rev-parse HEAD)
echo "hash:$current_hash" >> "$RESULT"
echo "current_hash: $current_hash"
TOTAL_LEN=$((INPUT_LEN + OUTPUT_LEN))
RED='\033[0;31m'
if (( TOTAL_LEN > MAX_MODEL_LEN )); then
echo -e "${RED}FAILED: INPUT_LEN($INPUT_LEN) + OUTPUT_LEN($OUTPUT_LEN) = $TOTAL_LEN, which is > MAX_MODEL_LEN = $MAX_MODEL_LEN.\033[0m" >&2
exit 1
fi
best_throughput=0
best_max_num_seqs=0
best_num_batched_tokens=0
@ -47,7 +56,7 @@ start_server() {
local max_num_batched_tokens=$3
local vllm_log=$4
local profile_dir=$5
pkill -f vllm
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
@ -60,13 +69,13 @@ start_server() {
--enable-prefix-caching \
--load-format dummy \
--download-dir "$DOWNLOAD_DIR" \
--max-model-len $(( INPUT_LEN+OUTPUT_LEN )) > "$vllm_log" 2>&1 &
--max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 &
# wait for 10 minutes...
server_started=0
for i in {1..60}; do
for i in {1..60}; do
RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)
STATUS_CODE=$(echo "$RESPONSE" | tail -n 1)
STATUS_CODE=$(echo "$RESPONSE" | tail -n 1)
if [[ "$STATUS_CODE" -eq 200 ]]; then
server_started=1
break
@ -89,10 +98,10 @@ update_best_profile() {
selected_profile_file=
if [[ "$SYSTEM" == "TPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
fi
fi
if [[ "$SYSTEM" == "GPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}"
fi
fi
rm -f $PROFILE_PATH/*
cp $selected_profile_file $PROFILE_PATH
}
@ -120,14 +129,14 @@ run_benchmark() {
echo "server started."
fi
echo
echo "run benchmark test..."
meet_latency_requirement=0
# get a basic qps by using request-rate inf
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt"
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
adjusted_input_len=$(( INPUT_LEN - prefix_len ))
python3 benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model $MODEL \
--dataset-name random \
@ -160,7 +169,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
curl -X POST http://0.0.0.0:8004/reset_prefix_cache
sleep 5
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
python3 benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model $MODEL \
--dataset-name random \
@ -245,4 +254,3 @@ done
echo "finish permutations"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"

View File

@ -11,6 +11,7 @@ from typing import Any, Optional
import numpy as np
from tqdm import tqdm
from typing_extensions import deprecated
import vllm.envs as envs
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
@ -34,6 +35,10 @@ def save_to_pytorch_benchmark_format(
write_to_json(pt_file, pt_records)
@deprecated(
"benchmark_latency.py is deprecated and will be removed in a "
"future version. Please use 'vllm bench latency' instead.",
)
def main(args: argparse.Namespace):
print(args)

View File

@ -38,6 +38,7 @@ from typing import Any, Literal, Optional
import numpy as np
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase
from typing_extensions import deprecated
from backend_request_func import (
ASYNC_REQUEST_FUNCS,
@ -395,20 +396,6 @@ async def benchmark(
tasks.append(asyncio.create_task(task))
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_prompt,
api_url=base_url + "/stop_profile",
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler stopped")
if pbar is not None:
pbar.close()
@ -517,6 +504,20 @@ async def benchmark(
print("=" * 50)
if profile:
print("Stopping profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_prompt,
api_url=base_url + "/stop_profile",
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler stopped")
return result
@ -593,6 +594,10 @@ def save_to_pytorch_benchmark_format(
write_to_json(pt_file, pt_records)
@deprecated(
"benchmark_serving.py is deprecated and will be removed in a future "
"version. Please use 'vllm bench serve' instead.",
)
def main(args: argparse.Namespace):
print(args)
random.seed(args.seed)

View File

@ -538,20 +538,6 @@ async def benchmark(
)
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_request.prompt,
api_url=base_url + "/stop_profile",
prompt_len=test_request.prompt_len,
output_len=test_request.expected_output_len,
extra_body={test_request.structure_type: test_request.schema},
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler stopped")
if pbar is not None:
pbar.close()
@ -666,6 +652,20 @@ async def benchmark(
print("=" * 50)
if profile:
print("Stopping profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_request.prompt,
api_url=base_url + "/stop_profile",
prompt_len=test_request.prompt_len,
output_len=test_request.expected_output_len,
extra_body={test_request.structure_type: test_request.schema},
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler stopped")
return result, ret

View File

@ -15,6 +15,7 @@ import torch
import uvloop
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerBase
from typing_extensions import deprecated
from benchmark_dataset import (
AIMODataset,
@ -167,7 +168,8 @@ async def run_vllm_async(
from vllm import SamplingParams
async with build_async_engine_client_from_engine_args(
engine_args, disable_frontend_multiprocessing
engine_args,
disable_frontend_multiprocessing=disable_frontend_multiprocessing,
) as llm:
model_config = await llm.get_model_config()
assert all(
@ -381,6 +383,10 @@ def get_requests(args, tokenizer):
return dataset_cls(**common_kwargs).sample(**sample_kwargs)
@deprecated(
"benchmark_throughput.py is deprecated and will be removed in a "
"future version. Please use 'vllm bench throughput' instead.",
)
def main(args: argparse.Namespace):
if args.seed is None:
args.seed = 0

View File

@ -3,7 +3,7 @@
# benchmark the overhead of disaggregated prefill.
# methodology:
# - send all request to prefill vLLM instance. It will buffer KV cache.
# - then send all request to decode instance.
# - then send all request to decode instance.
# - The TTFT of decode instance is the overhead.
set -ex
@ -12,6 +12,8 @@ kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
sleep 10
# remove vllm config file
@ -61,7 +63,7 @@ benchmark() {
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
@ -76,38 +78,38 @@ benchmark() {
wait_for_server 8200
# let the prefill instance finish prefill
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1.json \
--request-rate "inf"
vllm bench serve \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1.json \
--request-rate "inf"
# send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl.
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1_overhead.json \
--request-rate "$qps"
vllm bench serve \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1_overhead.json \
--request-rate "$qps"
kill_gpu_processes
}

View File

@ -18,6 +18,8 @@ kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1
}
@ -58,7 +60,7 @@ launch_chunked_prefill() {
launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
@ -97,20 +99,20 @@ benchmark() {
output_len=$2
tag=$3
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename "$tag"-qps-"$qps".json \
--request-rate "$qps"
vllm bench serve \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename "$tag"-qps-"$qps".json \
--request-rate "$qps"
sleep 2
}

View File

@ -5,9 +5,8 @@ import itertools
import torch
from vllm import _custom_ops as ops
from vllm.model_executor.layers.fused_moe.moe_align_block_size import (
moe_align_block_size_triton,
moe_align_block_size,
)
from vllm.triton_utils import triton
@ -21,60 +20,6 @@ def get_topk_ids(num_tokens: int, num_experts: int, topk: int) -> torch.Tensor:
)
def check_correctness(num_tokens, num_experts=256, block_size=256, topk=8):
"""
Verifies vllm vs. Triton
"""
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
# 1. malloc space for triton and vllm
# malloc enough space (max_num_tokens_padded) for the sorted ids
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids_triton = torch.empty(
(max_num_tokens_padded,), dtype=torch.int32, device="cuda"
)
expert_ids_triton = torch.empty(
(max_num_tokens_padded // block_size,), dtype=torch.int32, device="cuda"
)
num_tokens_post_pad_triton = torch.empty((1,), dtype=torch.int32, device="cuda")
sorted_ids_vllm = torch.empty_like(sorted_ids_triton)
expert_ids_vllm = torch.empty_like(expert_ids_triton)
num_tokens_post_pad_vllm = torch.empty_like(num_tokens_post_pad_triton)
# 2. run implementations
moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids_triton,
expert_ids_triton,
num_tokens_post_pad_triton,
)
ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids_vllm,
expert_ids_vllm,
num_tokens_post_pad_vllm,
)
print(f"✅ VLLM implementation works with {num_experts} experts!")
# 3. compare results
if torch.allclose(expert_ids_triton, expert_ids_vllm) and torch.allclose(
num_tokens_post_pad_triton, num_tokens_post_pad_vllm
):
print("✅ Triton and VLLM implementations match.")
else:
print("❌ Triton and VLLM implementations DO NOT match.")
print("Triton expert_ids:", expert_ids_triton)
print("VLLM expert_ids:", expert_ids_vllm)
print("Triton num_tokens_post_pad:", num_tokens_post_pad_triton)
print("VLLM num_tokens_post_pad:", num_tokens_post_pad_vllm)
# test configurations
num_tokens_range = [1, 16, 256, 4096]
num_experts_range = [16, 64, 224, 256, 280, 512]
@ -87,8 +32,8 @@ configs = list(itertools.product(num_tokens_range, num_experts_range, topk_range
x_names=["num_tokens", "num_experts", "topk"],
x_vals=configs,
line_arg="provider",
line_vals=["vllm", "triton"], # "triton"
line_names=["VLLM", "Triton"], # "Triton"
line_vals=["vllm"],
line_names=["vLLM"],
plot_name="moe-align-block-size-performance",
args={},
)
@ -98,36 +43,11 @@ def benchmark(num_tokens, num_experts, topk, provider):
block_size = 256
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids = torch.empty((max_num_tokens_padded,), dtype=torch.int32, device="cuda")
max_num_m_blocks = max_num_tokens_padded // block_size
expert_ids = torch.empty((max_num_m_blocks,), dtype=torch.int32, device="cuda")
num_tokens_post_pad = torch.empty((1,), dtype=torch.int32, device="cuda")
quantiles = [0.5, 0.2, 0.8]
if provider == "vllm":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
quantiles=quantiles,
)
elif provider == "triton":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
lambda: moe_align_block_size(topk_ids, block_size, num_experts),
quantiles=quantiles,
)
@ -151,6 +71,4 @@ if __name__ == "__main__":
)
args = parser.parse_args()
print("Running correctness check...")
check_correctness(num_tokens=1024, num_experts=args.num_experts, topk=args.topk)
benchmark.run(print_data=True, show_plots=True)

View File

@ -8,12 +8,13 @@ import ray
import torch
from transformers import AutoConfig
from vllm.model_executor.layers.fused_moe.deep_gemm_moe import (
from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
_moe_permute,
_moe_unpermute_and_reduce,
moe_permute,
moe_unpermute,
)
from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import *
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser
@ -63,18 +64,19 @@ def benchmark_permute(
def run():
if use_customized_permute:
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
moe_permute(
qhidden_states,
topk_weights=topk_weights,
topk_ids=topk_ids,
token_expert_indices=token_expert_indices,
topk=topk,
n_expert=num_experts,
n_local_expert=num_experts,
expert_map=None,
align_block_size=align_block_size,
)
(
permuted_hidden_states,
a1q_scale,
first_token_off,
inv_perm_idx,
m_indices,
) = moe_permute(
qhidden_states,
a1q_scale=None,
topk_ids=topk_ids,
n_expert=num_experts,
expert_map=None,
align_block_size=align_block_size,
)
else:
(
@ -150,18 +152,19 @@ def benchmark_unpermute(
def prepare():
if use_customized_permute:
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
moe_permute(
qhidden_states,
topk_weights=topk_weights,
topk_ids=topk_ids,
token_expert_indices=token_expert_indices,
topk=topk,
n_expert=num_experts,
n_local_expert=num_experts,
expert_map=None,
align_block_size=align_block_size,
)
(
permuted_hidden_states,
a1q_scale,
first_token_off,
inv_perm_idx,
m_indices,
) = moe_permute(
qhidden_states,
a1q_scale=None,
topk_ids=topk_ids,
n_expert=num_experts,
expert_map=None,
align_block_size=align_block_size,
)
# convert to fp16/bf16 as gemm output
return (
@ -191,16 +194,19 @@ def benchmark_unpermute(
def run(input: tuple):
if use_customized_permute:
(permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = input
(
permuted_hidden_states,
first_token_off,
inv_perm_idx,
m_indices,
) = input
output = torch.empty_like(hidden_states)
moe_unpermute(
output,
permuted_hidden_states,
topk_weights,
topk_ids,
inv_perm_idx,
first_token_off,
topk,
num_experts,
num_experts,
)
else:
(
@ -211,7 +217,11 @@ def benchmark_unpermute(
inv_perm,
) = input
_moe_unpermute_and_reduce(
output_hidden_states, permuted_hidden_states, inv_perm, topk_weights
output_hidden_states,
permuted_hidden_states,
inv_perm,
topk_weights,
True,
)
# JIT compilation & warmup

View File

@ -71,22 +71,20 @@ def benchmark_decode(
if kv_cache_dtype.startswith("fp8"):
kv_cache, _ = to_float8(kv_cache)
output_trtllm = torch.empty(q.shape, dtype=dtype)
# Benchmark TRT decode
def trt_decode():
return flashinfer.decode.trtllm_batch_decode_with_kv_cache(
q,
kv_cache,
workspace_buffer,
num_qo_heads,
num_kv_heads,
sm_scale,
block_tables,
kv_lens_tensor,
page_size,
max_kv_len,
kv_cache_dtype,
k_scale,
v_scale,
bmm1_scale=k_scale * sm_scale,
bmm2_scale=v_scale,
out=output_trtllm,
)
def time_fn(fn, warmup=10, trials=20):
@ -125,6 +123,8 @@ def benchmark_decode(
kv_indices = torch.tensor(kv_indices, dtype=torch.int32)
kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32)
output_baseline = torch.empty(q.shape, dtype=dtype)
wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
workspace_buffer,
kv_layout,
@ -145,7 +145,7 @@ def benchmark_decode(
)
def baseline_decode():
return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale)
return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale, output_baseline)
baseline_mean, baseline_std = time_fn(baseline_decode)
@ -214,25 +214,39 @@ if __name__ == "__main__":
max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
all_results = []
print("Running benchmark for kv_cache_dtype: bfloat16")
print(
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent"
"Running benchmark for q_dtype = bfloat16, kv_cache_dtype: bfloat16, "
"output_dtype: bfloat16"
)
print(
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t"
"baseline_std\tspeedup_percent"
)
for max_seq_len in max_seq_lens:
for bs in num_seqs:
result = benchmark_decode(
bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="auto"
bs,
max_seq_len,
dtype=torch.bfloat16,
kv_cache_dtype="auto",
)
all_results.append(result)
print("Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8")
print(
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent"
"Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8, "
"output_dtype: bfloat16"
)
print(
"\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t"
"baseline_std\tspeedup_percent"
)
for max_seq_len in max_seq_lens:
for bs in num_seqs:
result = benchmark_decode(
bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="fp8"
bs,
max_seq_len,
dtype=torch.bfloat16,
kv_cache_dtype="fp8",
)
all_results.append(result)

View File

@ -58,6 +58,22 @@ function (find_isa CPUINFO TARGET OUT)
endif()
endfunction()
function(check_sysctl TARGET OUT)
execute_process(COMMAND sysctl -n "${TARGET}"
RESULT_VARIABLE SYSCTL_RET
OUTPUT_VARIABLE SYSCTL_INFO
ERROR_QUIET
OUTPUT_STRIP_TRAILING_WHITESPACE)
if(SYSCTL_RET EQUAL 0 AND
(SYSCTL_INFO STREQUAL "1" OR SYSCTL_INFO GREATER 0))
set(${OUT} ON PARENT_SCOPE)
else()
set(${OUT} OFF PARENT_SCOPE)
endif()
endfunction()
function (is_avx512_disabled OUT)
set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512})
if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true")
@ -70,7 +86,10 @@ endfunction()
is_avx512_disabled(AVX512_DISABLED)
if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
set(APPLE_SILICON_FOUND TRUE)
message(STATUS "Apple Silicon Detected")
set(ENABLE_NUMA OFF)
check_sysctl(hw.optional.neon ASIMD_FOUND)
check_sysctl(hw.optional.arm.FEAT_BF16 ARM_BF16_FOUND)
else()
find_isa(${CPUINFO} "avx2" AVX2_FOUND)
find_isa(${CPUINFO} "avx512f" AVX512_FOUND)
@ -82,7 +101,6 @@ else()
find_isa(${CPUINFO} "S390" S390_FOUND)
endif()
if (AVX512_FOUND AND NOT AVX512_DISABLED)
list(APPEND CXX_COMPILE_FLAGS
"-mavx512f"
@ -149,9 +167,6 @@ elseif (ASIMD_FOUND)
set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16")
endif()
list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS})
elseif(APPLE_SILICON_FOUND)
message(STATUS "Apple Silicon Detected")
set(ENABLE_NUMA OFF)
elseif (S390_FOUND)
message(STATUS "S390 detected")
# Check for S390 VXE support

View File

@ -24,7 +24,7 @@
#include "attention_dtypes.h"
#include "attention_utils.cuh"
#include "cuda_compat.h"
#include "../cuda_compat.h"
#ifdef USE_ROCM
#include <hip/hip_bf16.h>

View File

@ -16,9 +16,8 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "attention_kernels.cuh"
#include "cuda_compat.h"
#include "../cuda_compat.h"
#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))
@ -75,7 +74,7 @@ void paged_attention_v1_launcher(
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
const int NUM_WARPS = NUM_THREADS / WARP_SIZE;
int padded_max_seq_len =
DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE;
int logits_size = padded_max_seq_len * sizeof(float);

View File

@ -16,9 +16,8 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "attention_kernels.cuh"
#include "cuda_compat.h"
#include "../cuda_compat.h"
#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))
@ -79,7 +78,7 @@ void paged_attention_v2_launcher(
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
const int NUM_WARPS = NUM_THREADS / WARP_SIZE;
int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE);
int logits_size = PARTITION_SIZE * sizeof(float);
int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);

View File

@ -151,7 +151,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("rotary_embedding", torch::kCPU, &rotary_embedding);
// Quantization
#if defined(__AVX512F__) || defined(__aarch64__)
#if defined(__AVX512F__) || (defined(__aarch64__) && !defined(__APPLE__))
at::Tag stride_tag = at::Tag::needs_fixed_stride_order;
// Compute int8 quantized tensor for given scaling factor.

View File

@ -4,8 +4,35 @@
#include <hip/hip_runtime.h>
#endif
#if defined(USE_ROCM) && defined(__GFX9__)
#define WARP_SIZE 64
#ifdef USE_ROCM
struct Utils {
static __host__ int get_warp_size() {
static bool is_cached = false;
static int result;
if (!is_cached) {
int device_id;
cudaDeviceProp deviceProp;
cudaGetDevice(&device_id);
cudaGetDeviceProperties(&deviceProp, device_id);
result = deviceProp.warpSize;
is_cached = true;
}
return result;
}
static __device__ constexpr int get_warp_size() {
#ifdef __GFX9__
return 64;
#else
return 32;
#endif
}
};
#define WARP_SIZE Utils::get_warp_size()
#else
#define WARP_SIZE 32
#endif

View File

@ -10,32 +10,28 @@
void moe_permute(
const torch::Tensor& input, // [n_token, hidden]
const torch::Tensor& topk_weights, //[n_token, topk]
torch::Tensor& topk_ids, // [n_token, topk]
const torch::Tensor& topk_ids, // [n_token, topk]
const torch::Tensor& token_expert_indices, // [n_token, topk]
const std::optional<torch::Tensor>& expert_map, // [n_expert]
int64_t n_expert, int64_t n_local_expert, int64_t topk,
const std::optional<int64_t>& align_block_size,
torch::Tensor&
permuted_input, // [topk * n_token/align_block_size_m, hidden]
torch::Tensor& permuted_input, // [permuted_size, hidden]
torch::Tensor& expert_first_token_offset, // [n_local_expert + 1]
torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
torch::Tensor& inv_permuted_idx, // [n_token, topk]
torch::Tensor& permuted_idx, // [permute_size]
torch::Tensor& m_indices) { // [align_expand_m]
TORCH_CHECK(topk_weights.scalar_type() == at::ScalarType::Float,
"topk_weights must be float32");
TORCH_CHECK(expert_first_token_offset.scalar_type() == at::ScalarType::Long,
"expert_first_token_offset must be int64");
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
"topk_ids must be int32");
TORCH_CHECK(token_expert_indices.scalar_type() == at::ScalarType::Int,
"token_expert_indices must be int32");
TORCH_CHECK(src_row_id2dst_row_id_map.scalar_type() == at::ScalarType::Int,
"src_row_id2dst_row_id_map must be int32");
TORCH_CHECK(inv_permuted_idx.scalar_type() == at::ScalarType::Int,
"inv_permuted_idx must be int32");
TORCH_CHECK(expert_first_token_offset.size(0) == n_local_expert + 1,
"expert_first_token_offset shape != n_local_expert+1")
TORCH_CHECK(
src_row_id2dst_row_id_map.sizes() == token_expert_indices.sizes(),
"token_expert_indices shape must be same as src_row_id2dst_row_id_map");
TORCH_CHECK(inv_permuted_idx.sizes() == token_expert_indices.sizes(),
"token_expert_indices shape must be same as inv_permuted_idx");
auto n_token = input.sizes()[0];
auto n_hidden = input.sizes()[1];
auto align_block_size_value =
@ -46,8 +42,9 @@ void moe_permute(
auto sort_workspace = torch::empty(
{sorter_size},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
auto copy_topk_ids = topk_ids.clone(); // copy topk_ids for preprocess
auto permuted_experts_id = torch::empty_like(topk_ids);
auto dst_row_id2src_row_id_map = torch::empty_like(src_row_id2dst_row_id_map);
auto sorted_row_idx = torch::empty_like(inv_permuted_idx);
auto align_expert_first_token_offset =
torch::zeros_like(expert_first_token_offset);
@ -67,24 +64,22 @@ void moe_permute(
const int* expert_map_ptr = get_ptr<int>(expert_map.value());
valid_num_ptr =
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
preprocessTopkIdLauncher(get_ptr<int>(topk_ids), n_token * topk,
preprocessTopkIdLauncher(get_ptr<int>(copy_topk_ids), n_token * topk,
expert_map_ptr, n_expert, stream);
}
// expert sort topk expert id and scan expert id get expert_first_token_offset
sortAndScanExpert(get_ptr<int>(topk_ids), get_ptr<int>(token_expert_indices),
get_ptr<int>(permuted_experts_id),
get_ptr<int>(dst_row_id2src_row_id_map),
get_ptr<int64_t>(expert_first_token_offset), n_token,
n_expert, n_local_expert, topk, sorter,
get_ptr<int>(sort_workspace), stream);
sortAndScanExpert(
get_ptr<int>(copy_topk_ids), get_ptr<int>(token_expert_indices),
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
get_ptr<int64_t>(expert_first_token_offset), n_token, n_expert,
n_local_expert, topk, sorter, get_ptr<int>(sort_workspace), stream);
// dispatch expandInputRowsKernelLauncher
MOE_DISPATCH(input.scalar_type(), [&] {
expandInputRowsKernelLauncher<scalar_t>(
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
get_ptr<float>(topk_weights), get_ptr<int>(permuted_experts_id),
get_ptr<int>(dst_row_id2src_row_id_map),
get_ptr<int>(src_row_id2dst_row_id_map),
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
get_ptr<int>(inv_permuted_idx), get_ptr<int>(permuted_idx),
get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr,
n_hidden, topk, n_local_expert, align_block_size_value, stream);
});
@ -101,32 +96,34 @@ void moe_permute(
}
void moe_unpermute(
const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden]
const torch::Tensor& topk_weights, //[n_token, topk]
const torch::Tensor& topk_ids, // [n_token, topk]
const torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
const torch::Tensor& expert_first_token_offset, // [n_local_expert+1]
int64_t n_expert, int64_t n_local_expert, int64_t topk,
const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden]
const torch::Tensor& topk_weights, // [n_token, topk]
const torch::Tensor& inv_permuted_idx, // [n_token, topk]
const std::optional<torch::Tensor>&
expert_first_token_offset, // [n_local_expert+1]
int64_t topk,
torch::Tensor& hidden_states // [n_token, hidden]
) {
TORCH_CHECK(src_row_id2dst_row_id_map.sizes() == topk_ids.sizes(),
"topk_ids shape must be same as src_row_id2dst_row_id_map");
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
"topk_ids must be int32");
TORCH_CHECK(
permuted_hidden_states.scalar_type() == hidden_states.scalar_type(),
"topk_ids dtype must be same as src_row_id2dst_row_id_map");
"permuted_hidden_states dtype must be same as hidden_states");
auto n_token = hidden_states.size(0);
auto n_hidden = hidden_states.size(1);
auto stream = at::cuda::getCurrentCUDAStream().stream();
const int64_t* valid_ptr =
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
int64_t const* valid_ptr = nullptr;
if (expert_first_token_offset.has_value()) {
int n_local_expert = expert_first_token_offset.value().size(0) - 1;
valid_ptr =
get_ptr<int64_t>(expert_first_token_offset.value()) + n_local_expert;
}
MOE_DISPATCH(hidden_states.scalar_type(), [&] {
finalizeMoeRoutingKernelLauncher<scalar_t, scalar_t>(
get_ptr<scalar_t>(permuted_hidden_states),
get_ptr<scalar_t>(hidden_states), get_ptr<float>(topk_weights),
get_ptr<int>(src_row_id2dst_row_id_map), get_ptr<int>(topk_ids),
n_token, n_hidden, topk, valid_ptr, stream);
get_ptr<int>(inv_permuted_idx), n_token, n_hidden, topk, valid_ptr,
stream);
});
}

View File

@ -177,7 +177,7 @@ __global__ void getMIndicesKernel(int64_t* expert_first_token_offset,
int tidx = threadIdx.x;
extern __shared__ int64_t smem_expert_first_token_offset[];
for (int i = tidx; i <= num_local_expert; i += blockDim.x) {
smem_expert_first_token_offset[tidx] = __ldg(expert_first_token_offset + i);
smem_expert_first_token_offset[i] = __ldg(expert_first_token_offset + i);
}
__syncthreads();
auto last_token_offset = smem_expert_first_token_offset[eidx + 1];

View File

@ -57,31 +57,19 @@ void sortAndScanExpert(int* expert_for_source_row, const int* source_rows,
template <typename T>
void expandInputRowsKernelLauncher(
T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
int num_local_experts, const int& align_block_size, cudaStream_t stream);
// Final kernel to unpermute and scale
// This kernel unpermutes the original data, does the k-way reduction and
// performs the final skip connection.
template <typename T, typename OutputType, bool CHECK_SKIPPED>
__global__ void finalizeMoeRoutingKernel(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
int64_t const* num_valid_ptr);
template <class T, class OutputType>
void finalizeMoeRoutingKernelLauncher(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int const* expert_for_source_row, int64_t const num_rows,
int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
cudaStream_t stream);
int64_t const num_rows, int64_t const cols, int64_t const k,
int64_t const* num_valid_ptr, cudaStream_t stream);
void preprocessTopkIdLauncher(int* topk_id_ptr, int size,
const int* expert_map_ptr, int num_experts,

View File

@ -2,10 +2,9 @@
template <typename T, bool CHECK_SKIPPED, bool ALIGN_BLOCK_SIZE>
__global__ void expandInputRowsKernel(
T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_dest_rows, int64_t const cols, int64_t k,
int num_local_experts, int align_block_size) {
@ -54,6 +53,10 @@ __global__ void expandInputRowsKernel(
assert(expanded_dest_row <= INT32_MAX);
expanded_source_row_to_expanded_dest_row[expanded_source_row] =
static_cast<int>(expanded_dest_row);
// skip non local expert token
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
permuted_idx[expanded_dest_row] = expanded_source_row;
}
}
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
@ -62,7 +65,7 @@ __global__ void expandInputRowsKernel(
using DataElem = cutlass::Array<T, ELEM_PER_THREAD>;
// Duplicate and permute rows
int64_t const source_row = expanded_source_row % num_rows;
int64_t const source_row = expanded_source_row / k;
auto const* source_row_ptr =
reinterpret_cast<DataElem const*>(unpermuted_input + source_row * cols);
@ -82,10 +85,9 @@ __global__ void expandInputRowsKernel(
template <typename T>
void expandInputRowsKernelLauncher(
T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
int num_local_experts, const int& align_block_size, cudaStream_t stream) {
@ -105,11 +107,11 @@ void expandInputRowsKernelLauncher(
int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1);
func<<<blocks, threads, smem_size, stream>>>(
unpermuted_input, permuted_output, unpermuted_scales, sorted_experts,
unpermuted_input, permuted_output, sorted_experts,
expanded_dest_row_to_expanded_source_row,
expanded_source_row_to_expanded_dest_row, expert_first_token_offset,
num_rows, num_valid_tokens_ptr, cols, k, num_local_experts,
align_block_size);
expanded_source_row_to_expanded_dest_row, permuted_idx,
expert_first_token_offset, num_rows, num_valid_tokens_ptr, cols, k,
num_local_experts, align_block_size);
}
template <class T, class U>
@ -128,11 +130,9 @@ template <typename T, typename OutputType, bool CHECK_SKIPPED>
__global__ void finalizeMoeRoutingKernel(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
int64_t const* num_valid_ptr) {
int64_t const orig_cols, int64_t const k, int64_t const* num_valid_ptr) {
assert(orig_cols % 4 == 0);
int64_t const original_row = blockIdx.x;
int64_t const num_rows = gridDim.x;
auto const offset = original_row * orig_cols;
OutputType* reduced_row_ptr = reduced_unpermuted_output + offset;
int64_t const num_valid = *num_valid_ptr;
@ -159,14 +159,13 @@ __global__ void finalizeMoeRoutingKernel(
ComputeElem thread_output;
thread_output.fill(0);
for (int k_idx = 0; k_idx < k; ++k_idx) {
int64_t const expanded_original_row = original_row + k_idx * num_rows;
int64_t const expanded_original_row = original_row * k + k_idx;
int64_t const expanded_permuted_row =
expanded_source_row_to_expanded_dest_row[expanded_original_row];
int64_t const k_offset = original_row * k + k_idx;
float const row_scale = scales[k_offset];
// Check after row_rescale has accumulated
if (CHECK_SKIPPED && expanded_permuted_row >= num_valid) {
continue;
}
@ -189,9 +188,8 @@ template <class T, class OutputType>
void finalizeMoeRoutingKernelLauncher(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int const* expert_for_source_row, int64_t const num_rows,
int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
cudaStream_t stream) {
int64_t const num_rows, int64_t const cols, int64_t const k,
int64_t const* num_valid_ptr, cudaStream_t stream) {
int64_t const blocks = num_rows;
int64_t const threads = 256;
bool const check_finished = num_valid_ptr != nullptr;
@ -201,6 +199,5 @@ void finalizeMoeRoutingKernelLauncher(
auto* const kernel = func_map[check_finished];
kernel<<<blocks, threads, 0, stream>>>(
expanded_permuted_rows, reduced_unpermuted_output, scales,
expanded_source_row_to_expanded_dest_row, expert_for_source_row, cols, k,
num_valid_ptr);
expanded_source_row_to_expanded_dest_row, cols, k, num_valid_ptr);
}

View File

@ -190,8 +190,8 @@ __launch_bounds__(TPB) __global__ void moeTopK(
2) This implementation assumes k is small, but will work for any k.
*/
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, typename IndType>
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename IndType>
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices,
int* source_rows, const int k, const int start_expert, const int end_expert)
{
@ -209,12 +209,12 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
// Restrictions based on previous section.
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
static_assert(WARP_SIZE % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2");
static_assert(THREADS_PER_ROW <= WARP_SIZE, "THREADS_PER_ROW can be at most warp size");
static_assert(THREADS_PER_ROW <= WARP_SIZE_PARAM, "THREADS_PER_ROW can be at most warp size");
// We have NUM_EXPERTS elements per row. We specialize for small #experts
static constexpr int ELTS_PER_WARP = WARP_SIZE * VPT;
static constexpr int ELTS_PER_WARP = WARP_SIZE_PARAM * VPT;
static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW;
static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP;
@ -393,41 +393,51 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
namespace detail
{
// Constructs some constants needed to partition the work across threads at compile time.
template <int EXPERTS, int BYTES_PER_LDG>
template <int EXPERTS, int BYTES_PER_LDG, int WARP_SIZE_PARAM>
struct TopkConstants
{
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE) == 0, "");
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE));
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, "");
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM));
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
static constexpr int THREADS_PER_ROW = EXPERTS / VPT;
static constexpr int ROWS_PER_WARP = WARP_SIZE / THREADS_PER_ROW;
static const int ROWS_PER_WARP = WARP_SIZE_PARAM / THREADS_PER_ROW;
};
} // namespace detail
template <int EXPERTS, int WARPS_PER_TB, typename IndType>
template <int EXPERTS, int WARPS_PER_TB, int WARP_SIZE_PARAM, typename IndType>
void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices,
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream)
{
static constexpr std::size_t MAX_BYTES_PER_LDG = 16;
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS);
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG>;
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG, WARP_SIZE_PARAM>;
static constexpr int VPT = Constants::VPT;
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
dim3 block_dim(WARP_SIZE, WARPS_PER_TB);
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG><<<num_blocks, block_dim, 0, stream>>>(
dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB);
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG, WARP_SIZE_PARAM><<<num_blocks, block_dim, 0, stream>>>(
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert);
}
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB>( \
gating_output, nullptr, topk_weights, topk_indices, \
token_expert_indices, num_tokens, topk, 0, num_experts, \
stream);
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \
switch (warpSize) { \
case 32: \
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 32>( \
gating_output, nullptr, topk_weights, topk_indices, \
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
break; \
case 64: \
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 64>( \
gating_output, nullptr, topk_weights, topk_indices, \
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
break; \
default: \
TORCH_CHECK(false, "Unsupported warp size: ", warpSize); \
}
template <typename IndType>
void topkGatingSoftmaxKernelLauncher(
@ -441,6 +451,7 @@ void topkGatingSoftmaxKernelLauncher(
const int topk,
cudaStream_t stream) {
static constexpr int WARPS_PER_TB = 4;
auto warpSize = WARP_SIZE;
switch (num_experts) {
case 1:
LAUNCH_SOFTMAX(1, WARPS_PER_TB);

View File

@ -56,18 +56,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
" -> Tensor");
m.def(
"moe_permute(Tensor input, Tensor topk_weight, Tensor! topk_ids,"
"moe_permute(Tensor input, Tensor topk_ids,"
"Tensor token_expert_indices, Tensor? expert_map, int n_expert,"
"int n_local_expert,"
"int topk, int? align_block_size,Tensor! permuted_input, Tensor! "
"expert_first_token_offset, Tensor! src_row_id2dst_row_id_map, Tensor! "
"m_indices)->()");
"expert_first_token_offset, Tensor! inv_permuted_idx, Tensor! "
"permuted_idx, Tensor! m_indices)->()");
m.def(
"moe_unpermute(Tensor permuted_hidden_states, Tensor topk_weights,"
"Tensor topk_ids,Tensor src_row_id2dst_row_id_map, Tensor "
"expert_first_token_offset, int n_expert, int n_local_expert,int "
"topk, Tensor! hidden_states)->()");
"Tensor inv_permuted_idx, Tensor? expert_first_token_offset, "
"int topk, Tensor! hidden_states)->()");
m.def("moe_permute_unpermute_supported() -> bool");
m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported);

View File

@ -292,6 +292,11 @@ void per_token_group_quant_fp8(const torch::Tensor& input,
torch::Tensor& output_q, torch::Tensor& output_s,
int64_t group_size, double eps, double fp8_min,
double fp8_max, bool scale_ue8m0);
void per_token_group_quant_int8(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double int8_min, double int8_max);
#endif
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,

View File

@ -4,7 +4,7 @@
#include <cmath>
#include "core/math.hpp"
#include "cuda_compat.h"
#include "../cuda_compat.h"
#include "dispatch_utils.h"
#include "quantization/fp8/common.cuh"

View File

@ -1,6 +1,10 @@
#include <ATen/cuda/CUDAContext.h>
#include <torch/all.h>
#ifndef USE_ROCM
#include "../per_token_group_quant_8bit.h"
#endif
#include <cmath>
#include "../../dispatch_utils.h"
@ -336,3 +340,13 @@ void dynamic_scaled_int8_quant(
}
});
}
#ifndef USE_ROCM
void per_token_group_quant_int8(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double int8_min, double int8_max) {
per_token_group_quant_8bit(input, output_q, output_s, group_size, eps,
int8_min, int8_max);
}
#endif

View File

@ -1,6 +1,5 @@
#include "scaled_mm_kernels.hpp"
#include "scaled_mm_sm90_fp8_dispatch.cuh"
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
namespace vllm {
@ -13,11 +12,11 @@ void cutlass_scaled_mm_sm90_fp8(torch::Tensor& out, torch::Tensor const& a,
if (bias) {
TORCH_CHECK(bias->dtype() == out.dtype(),
"currently bias dtype must match output dtype ", out.dtype());
return cutlass_scaled_mm_sm90_fp8_epilogue<c3x::ScaledEpilogueBias>(
out, a, b, a_scales, b_scales, *bias);
return cutlass_scaled_mm_sm90_fp8_epilogue<true>(out, a, b, a_scales,
b_scales, *bias);
} else {
return cutlass_scaled_mm_sm90_fp8_epilogue<c3x::ScaledEpilogue>(
out, a, b, a_scales, b_scales);
return cutlass_scaled_mm_sm90_fp8_epilogue<false>(out, a, b, a_scales,
b_scales);
}
}

View File

@ -2,6 +2,7 @@
#include "scaled_mm.cuh"
#include "cutlass_gemm_caller.cuh"
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
/**
* This file defines Gemm kernel configurations for SM90 (fp8) based on the Gemm
@ -12,8 +13,91 @@ namespace vllm {
using c3x::cutlass_gemm_caller;
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
template <typename ElementAB_, typename ElementD_,
template <typename, typename, typename> typename Epilogue_,
typename TileShape, typename ClusterShape, typename KernelSchedule,
typename EpilogueSchedule, bool swap_ab_ = false>
struct cutlass_3x_gemm_sm90_fp8 {
using ElementAB = ElementAB_;
using ElementC = ElementD_;
using ElementD = ElementD_;
using ElementAcc =
typename std::conditional<std::is_same_v<ElementAB, int8_t>, int32_t,
float>::type;
using Epilogue = Epilogue_<ElementAcc, ElementD, TileShape>;
using EVTCompute = typename Epilogue::EVTCompute;
static constexpr int AlignmentAB =
128 / cutlass::sizeof_bits<ElementAB>::value;
static constexpr int AlignmentCD =
128 / cutlass::sizeof_bits<ElementD>::value;
// Compile-time swap_ab flag
static constexpr bool swap_ab = swap_ab_;
// -----------------------------------------------------------
// Layout definitions
// -----------------------------------------------------------
using LayoutA = cutlass::layout::RowMajor;
using LayoutA_T = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutB_T = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
using LayoutD = cutlass::layout::RowMajor;
using LayoutD_Transpose =
typename cutlass::layout::LayoutTranspose<LayoutD>::type;
using LayoutC = LayoutD;
using LayoutC_Transpose = LayoutD_Transpose;
// -----------------------------------------------------------
// Collective epilogue (conditionally swap operands and layouts)
// -----------------------------------------------------------
using CollectiveEpilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, TileShape,
ClusterShape, cutlass::epilogue::collective::EpilogueTileAuto,
ElementAcc, float, ElementC,
conditional_t<swap_ab, LayoutC_Transpose, LayoutC>, AlignmentCD,
ElementD, conditional_t<swap_ab, LayoutD_Transpose, LayoutD>,
AlignmentCD, EpilogueSchedule, EVTCompute>::CollectiveOp;
static constexpr size_t CEStorageSize =
sizeof(typename CollectiveEpilogue::SharedStorage);
using Stages = typename cutlass::gemm::collective::StageCountAutoCarveout<
static_cast<int>(CEStorageSize)>;
// -----------------------------------------------------------
// Collective mainloop (conditionally swap operands and layouts)
// -----------------------------------------------------------
using CollectiveMainloop = conditional_t<
swap_ab,
typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, ElementAB,
LayoutB_T, AlignmentAB, // Swapped B (as A)
ElementAB, LayoutA_T, AlignmentAB, // Swapped A (as B)
ElementAcc, TileShape, ClusterShape, Stages,
KernelSchedule>::CollectiveOp,
typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp, ElementAB,
LayoutA, AlignmentAB, ElementAB, LayoutB, AlignmentAB, ElementAcc,
TileShape, ClusterShape, Stages, KernelSchedule>::CollectiveOp>;
// -----------------------------------------------------------
// Kernel definition
// -----------------------------------------------------------
using KernelType = enable_sm90_or_later<cutlass::gemm::kernel::GemmUniversal<
cute::Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue,
cutlass::gemm::PersistentScheduler>>;
struct GemmKernel : public KernelType {};
};
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_default {
// M in (128, inf)
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
@ -22,13 +106,17 @@ struct sm90_fp8_config_default {
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_128, _128, _128>;
using ClusterShape = Shape<_2, _1, _1>;
using Cutlass3xGemm =
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
KernelSchedule, EpilogueSchedule>;
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule>>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_M128 {
// M in (64, 128]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
@ -37,33 +125,146 @@ struct sm90_fp8_config_M128 {
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_64, _128, _128>;
using ClusterShape = Shape<_2, _1, _1>;
using Cutlass3xGemm =
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
KernelSchedule, EpilogueSchedule>;
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule>>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm90_fp8_config_M64 {
// M in [1, 64]
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_M64_N1280 {
// M in (16, 64], N in [1 1280]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule =
cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum;
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_64, _64, _128>;
using ClusterShape = Shape<_1, _8, _1>;
using TileShape = Shape<_64, _16, _256>;
using ClusterShape = Shape<_1, _4, _1>;
using Cutlass3xGemm =
cutlass_3x_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
KernelSchedule, EpilogueSchedule>;
// enable swap AB for M < 64
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule, true>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue,
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_M64_N8192 {
// M in (16, 64], N > 1280
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_64, _64, _256>;
using ClusterShape = Shape<_1, _1, _1>;
// enable swap AB for M < 64
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule, true>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>>;
};
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_M16_N1280 {
// M in [1, 16], N in [1, 1280]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_64, _16, _256>;
using ClusterShape = Shape<_1, _2, _1>;
// enable swap AB for M < 64
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule, true>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>>;
};
template <typename InType, typename OutType, bool EnableBias>
struct sm90_fp8_config_M16_N8192 {
// M in [1, 16], N > 1280
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum;
using EpilogueSchedule = typename cutlass::epilogue::TmaWarpSpecialized;
using TileShape = Shape<_64, _16, _256>;
using ClusterShape = Shape<_1, _1, _1>;
// enable swap AB for M < 64
using Cutlass3xGemm = conditional_t<
EnableBias,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogueColumnBias,
TileShape, ClusterShape, KernelSchedule,
EpilogueSchedule, true>,
cutlass_3x_gemm_sm90_fp8<InType, OutType, c3x::ScaledEpilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>>;
};
template <typename Gemm, typename... EpilogueArgs>
void cutlass_gemm_caller_sm90_fp8(torch::Tensor& out, torch::Tensor const& a,
torch::Tensor const& b,
EpilogueArgs&&... epilogue_params) {
static constexpr bool swap_ab = Gemm::swap_ab;
using ElementAB = typename Gemm::ElementAB;
using ElementD = typename Gemm::ElementD;
using GemmKernel = typename Gemm::GemmKernel;
using StrideA = typename Gemm::GemmKernel::StrideA;
using StrideB = typename Gemm::GemmKernel::StrideB;
using StrideC = typename Gemm::GemmKernel::StrideC;
int32_t m = a.size(0), n = b.size(1), k = a.size(1);
auto prob_shape =
swap_ab ? cute::make_shape(n, m, k, 1) : cute::make_shape(m, n, k, 1);
StrideA a_stride =
cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(m, k, 1));
StrideB b_stride =
cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(n, k, 1));
StrideC c_stride = cutlass::make_cute_packed_stride(
StrideC{},
swap_ab ? cute::make_shape(n, m, 1) : cute::make_shape(m, n, 1));
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
auto c_ptr = static_cast<ElementD*>(out.data_ptr());
typename GemmKernel::MainloopArguments mainloop_args =
swap_ab ? typename GemmKernel::MainloopArguments{b_ptr, b_stride, a_ptr,
a_stride}
: typename GemmKernel::MainloopArguments{a_ptr, a_stride, b_ptr,
b_stride};
typename GemmKernel::EpilogueArguments epilogue_args{
Gemm::Epilogue::prepare_args(
std::forward<EpilogueArgs>(epilogue_params)...),
c_ptr, c_stride, c_ptr, c_stride};
c3x::cutlass_gemm_caller<GemmKernel>(a.device(), prob_shape, mainloop_args,
epilogue_args);
}
template <typename InType, typename OutType, bool EnableBias,
typename... EpilogueArgs>
inline void cutlass_gemm_sm90_fp8_dispatch(torch::Tensor& out,
torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
EpilogueArgs&&... args) {
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
@ -71,50 +272,75 @@ inline void cutlass_gemm_sm90_fp8_dispatch(torch::Tensor& out,
using Cutlass3xGemmDefault =
typename sm90_fp8_config_default<InType, OutType,
Epilogue>::Cutlass3xGemm;
using Cutlass3xGemmM64 =
typename sm90_fp8_config_M64<InType, OutType, Epilogue>::Cutlass3xGemm;
EnableBias>::Cutlass3xGemm;
using Cutlass3xGemmM128 =
typename sm90_fp8_config_M128<InType, OutType, Epilogue>::Cutlass3xGemm;
typename sm90_fp8_config_M128<InType, OutType, EnableBias>::Cutlass3xGemm;
using Cutlass3xGemmM64_N1280 =
typename sm90_fp8_config_M64_N1280<InType, OutType,
EnableBias>::Cutlass3xGemm;
using Cutlass3xGemmM64_N8192 =
typename sm90_fp8_config_M64_N8192<InType, OutType,
EnableBias>::Cutlass3xGemm;
using Cutlass3xGemmM16_N1280 =
typename sm90_fp8_config_M16_N1280<InType, OutType,
EnableBias>::Cutlass3xGemm;
using Cutlass3xGemmM16_N8192 =
typename sm90_fp8_config_M16_N8192<InType, OutType,
EnableBias>::Cutlass3xGemm;
uint32_t const m = a.size(0);
uint32_t const mp2 =
std::max(static_cast<uint32_t>(64), next_pow_2(m)); // next power of 2
uint32_t const n = b.size(1);
if (mp2 <= 64) {
// m in [1, 64]
return cutlass_gemm_caller<Cutlass3xGemmM64>(
out, a, b, std::forward<EpilogueArgs>(args)...);
} else if (mp2 <= 128) {
if (m <= 16) {
// m in [1, 16]
if (n <= 1280) {
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM16_N1280>(
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
}
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM16_N8192>(
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
} else if (m <= 64) {
// m in (16, 64]
if (n <= 1280) {
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM64_N1280>(
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
}
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM64_N8192>(
out, a, b, b_scales, a_scales, std::forward<EpilogueArgs>(args)...);
} else if (m <= 128) {
// m in (64, 128]
return cutlass_gemm_caller<Cutlass3xGemmM128>(
out, a, b, std::forward<EpilogueArgs>(args)...);
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmM128>(
out, a, b, a_scales, b_scales, std::forward<EpilogueArgs>(args)...);
} else {
// m in (128, inf)
return cutlass_gemm_caller<Cutlass3xGemmDefault>(
out, a, b, std::forward<EpilogueArgs>(args)...);
return cutlass_gemm_caller_sm90_fp8<Cutlass3xGemmDefault>(
out, a, b, a_scales, b_scales, std::forward<EpilogueArgs>(args)...);
}
}
template <template <typename, typename, typename> typename Epilogue,
typename... EpilogueArgs>
template <bool EnableBias, typename... EpilogueArgs>
void cutlass_scaled_mm_sm90_fp8_epilogue(torch::Tensor& out,
torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
EpilogueArgs&&... epilogue_args) {
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b.dtype() == torch::kFloat8_e4m3fn);
if (out.dtype() == torch::kBFloat16) {
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
cutlass::bfloat16_t, Epilogue>(
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
cutlass::bfloat16_t, EnableBias>(
out, a, b, a_scales, b_scales,
std::forward<EpilogueArgs>(epilogue_args)...);
} else {
TORCH_CHECK(out.dtype() == torch::kFloat16);
return cutlass_gemm_sm90_fp8_dispatch<cutlass::float_e4m3_t,
cutlass::half_t, Epilogue>(
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
cutlass::half_t, EnableBias>(
out, a, b, a_scales, b_scales,
std::forward<EpilogueArgs>(epilogue_args)...);
}
}
} // namespace vllm
} // namespace vllm

View File

@ -47,13 +47,12 @@ __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids,
__global__ void compute_expert_offsets(
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
int32_t* atomic_buffer, const int num_experts, const int topk_length) {
int32_t* atomic_buffer, const int num_experts, const bool swap_ab) {
int32_t tot_offset = 0;
expert_offsets[0] = 0;
for (int i = 0; i < num_experts; ++i) {
atomic_buffer[i] = tot_offset;
tot_offset += topk_length > SWAP_AB_THRESHOLD ? problem_sizes1[i * 3]
: problem_sizes1[i * 3 + 1];
tot_offset += swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3];
expert_offsets[i + 1] = tot_offset;
}
}
@ -61,15 +60,14 @@ __global__ void compute_expert_offsets(
__global__ void compute_expert_blockscale_offsets(
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts,
const int topk_length) {
const bool swap_ab) {
int32_t tot_offset = 0;
int32_t tot_offset_round = 0;
expert_offsets[0] = 0;
blockscale_offsets[0] = 0;
for (int i = 0; i < num_experts; ++i) {
int32_t cur_offset = topk_length > SWAP_AB_THRESHOLD
? problem_sizes1[i * 3]
: problem_sizes1[i * 3 + 1];
int32_t cur_offset =
swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3];
atomic_buffer[i] = tot_offset;
tot_offset += cur_offset;
expert_offsets[i + 1] = tot_offset;
@ -119,15 +117,19 @@ void get_cutlass_moe_mm_data_caller(
int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
if (topk_ids.numel() > SWAP_AB_THRESHOLD) {
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
// Swap-AB should be disabled for FP4 path
bool may_swap_ab = (!blockscale_offsets.has_value()) &&
(topk_ids.numel() <= SWAP_AB_THRESHOLD);
if (may_swap_ab) {
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()),
static_cast<int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(problem_sizes2.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n,
k);
} else {
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()),
static_cast<int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(problem_sizes2.data_ptr()),
@ -136,18 +138,19 @@ void get_cutlass_moe_mm_data_caller(
}
if (blockscale_offsets.has_value()) {
// fp4 path
compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>(
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(expert_offsets.data_ptr()),
static_cast<int32_t*>(blockscale_offsets.value().data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
topk_ids.numel());
may_swap_ab);
} else {
compute_expert_offsets<<<1, 1, 0, stream>>>(
static_cast<const int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(expert_offsets.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts,
topk_ids.numel());
may_swap_ab);
}
compute_arg_sorts<<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()),

View File

@ -1,6 +1,8 @@
#include <ATen/cuda/CUDAContext.h>
#include <c10/util/Float8_e4m3fn.h>
#include "../per_token_group_quant_8bit.h"
#include <cmath>
#include <cuda_fp16.h>
@ -120,7 +122,7 @@ void per_token_group_quant_8bit(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double min_8bit, double max_8bit,
bool scale_ue8m0 = false) {
bool scale_ue8m0) {
TORCH_CHECK(input.is_contiguous());
TORCH_CHECK(output_q.is_contiguous());
@ -198,6 +200,8 @@ void per_token_group_quant_8bit(const torch::Tensor& input,
input.scalar_type(), "per_token_group_quant_8bit", ([&] {
if (dst_type == at::ScalarType::Float8_e4m3fn) {
LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn);
} else if (dst_type == at::ScalarType::Char) {
LAUNCH_KERNEL(scalar_t, int8_t);
}
}));

View File

@ -4,7 +4,7 @@
#include <torch/all.h>
#include <c10/cuda/CUDAGuard.h>
#include "cuda_compat.h"
#include "../../cuda_compat.h"
#include "dispatch_utils.h"
#include "ggml-common.h"

View File

@ -187,8 +187,12 @@ struct PrepackedLayoutBTemplate {
CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy(
Shape_NKL shape_mkl) {
auto layout = TVbNbKL_to_offset(shape_mkl);
return make_layout(coalesce(get<0>(layout)), get<1>(layout),
get<2>(layout));
// for 4-bit elements, having >= 64 values per column
// allows TMA to load full 32-byte sectors
auto inner_layout =
make_layout(make_shape(_256{}, size<0>(layout) / _256{}));
return make_layout(inner_layout, get<1>(layout), get<2>(layout));
}
// ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx)

View File

@ -0,0 +1,10 @@
#pragma once
#include <torch/all.h>
// TODO(wentao): refactor the folder to 8bit, then includes fp8 and int8 folders
// 8-bit per-token-group quantization helper used by both FP8 and INT8
void per_token_group_quant_8bit(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double min_8bit, double max_8bit,
bool scale_ue8m0 = false);

View File

@ -19,7 +19,7 @@
#include <c10/cuda/CUDAGuard.h>
#include <hip/hip_fp8.h>
#include <hip/hip_bf16.h>
#include "cuda_compat.h"
#include "../cuda_compat.h"
#include <algorithm>
#include "../attention/dtype_fp8.cuh"

View File

@ -9,7 +9,7 @@
#include <stdexcept>
#include <algorithm>
#include "cuda_compat.h"
#include "../cuda_compat.h"
#include "dispatch_utils.h"
#include "quantization/fp8/common.cuh"

View File

@ -624,6 +624,14 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("per_token_group_fp8_quant", torch::kCUDA,
&per_token_group_quant_fp8);
// Compute per-token-group INT8 quantized tensor and scaling factor.
ops.def(
"per_token_group_quant_int8(Tensor input, Tensor! output_q, Tensor! "
"output_s, int group_size, float eps, float int8_min, float int8_max) -> "
"()");
ops.impl("per_token_group_quant_int8", torch::kCUDA,
&per_token_group_quant_int8);
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
ops.def(
"rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, "

View File

@ -209,16 +209,7 @@ ARG SCCACHE_REGION_NAME=us-west-2
ARG SCCACHE_S3_NO_CREDENTIALS=0
# Flag to control whether to use pre-built vLLM wheels
ARG VLLM_USE_PRECOMPILED
# TODO: in setup.py VLLM_USE_PRECOMPILED is sensitive to truthiness, it will take =0 as "true", this should be fixed
ENV VLLM_USE_PRECOMPILED=""
RUN if [ "${VLLM_USE_PRECOMPILED}" = "1" ]; then \
export VLLM_USE_PRECOMPILED=1 && \
echo "Using precompiled wheels"; \
else \
unset VLLM_USE_PRECOMPILED && \
echo "Leaving VLLM_USE_PRECOMPILED unset to build wheels from source"; \
fi
ARG VLLM_USE_PRECOMPILED=""
# if USE_SCCACHE is set, use sccache to speed up compilation
RUN --mount=type=cache,target=/root/.cache/uv \
@ -235,6 +226,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
&& export SCCACHE_IDLE_TIMEOUT=0 \
&& export CMAKE_BUILD_TYPE=Release \
&& export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" \
&& export VLLM_DOCKER_BUILD_CONTEXT=1 \
&& sccache --show-stats \
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
&& sccache --show-stats; \
@ -248,9 +241,22 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Clean any existing CMake artifacts
rm -rf .deps && \
mkdir -p .deps && \
export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" && \
export VLLM_DOCKER_BUILD_CONTEXT=1 && \
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
fi
# When using precompiled wheels, keep only the newest manylinux1 wheel and delete others
RUN if [ "$VLLM_USE_PRECOMPILED" = "1" ]; then \
echo "Cleaning up extra wheels in dist/..." && \
# Identify the most recent manylinux1_x86_64 wheel
KEEP_WHEEL=$(ls -t dist/*manylinux1_x86_64.whl 2>/dev/null | head -n1) && \
if [ -n "$KEEP_WHEEL" ]; then \
echo "Keeping wheel: $KEEP_WHEEL"; \
find dist/ -type f -name "*.whl" ! -path "${KEEP_WHEEL}" -delete; \
fi; \
fi
# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py
# sync the default value with .buildkite/check-wheel-size.py
@ -262,6 +268,15 @@ RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
else \
echo "Skipping wheel size check."; \
fi
#################### FLASHINFER WHEEL BUILD IMAGE ####################
FROM base AS flashinfer-wheel-builder
ARG CUDA_VERSION
COPY tools/flashinfer-build.sh /tmp/flashinfer-build.sh
RUN --mount=type=cache,target=/root/.cache/uv \
. /etc/environment && \
BUILD_WHEEL=true /tmp/flashinfer-build.sh
#################### EXTENSION Build IMAGE ####################
#################### DEV IMAGE ####################
@ -276,10 +291,6 @@ ARG PYTORCH_CUDA_INDEX_BASE_URL
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Workaround for #17068
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
COPY requirements/lint.txt requirements/lint.txt
COPY requirements/test.txt requirements/test.txt
COPY requirements/dev.txt requirements/dev.txt
@ -389,33 +400,11 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
# $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/v0.2.6.post1/flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl
# Install FlashInfer from source
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
ARG FLASHINFER_GIT_REF="v0.2.8rc1"
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
. /etc/environment
git clone --depth 1 --recursive --shallow-submodules \
--branch ${FLASHINFER_GIT_REF} \
${FLASHINFER_GIT_REPO} flashinfer
# Exclude CUDA arches for older versions (11.x and 12.0-12.7)
# TODO: Update this to allow setting TORCH_CUDA_ARCH_LIST as a build arg.
if [[ "${CUDA_VERSION}" == 11.* ]]; then
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9"
elif [[ "${CUDA_VERSION}" == 12.[0-7]* ]]; then
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a"
else
# CUDA 12.8+ supports 10.0a and 12.0
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 12.0"
fi
echo "🏗️ Building FlashInfer for arches: ${FI_TORCH_CUDA_ARCH_LIST}"
# Needed to build AOT kernels
pushd flashinfer
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
python3 -m flashinfer.aot
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
uv pip install --system --no-build-isolation .
popd
rm -rf flashinfer
BASH
# Version controlled in tools/flashinfer-build.sh - keep in sync with requirements/cuda.txt
COPY tools/flashinfer-build.sh /tmp/flashinfer-build.sh
RUN --mount=type=cache,target=/root/.cache/uv \
. /etc/environment && \
/tmp/flashinfer-build.sh
COPY examples examples
COPY benchmarks benchmarks
COPY ./vllm/collect_env.py .
@ -452,10 +441,6 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Workaround for #17068
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \

View File

@ -1,62 +0,0 @@
# This vLLM Dockerfile is used to construct an image that can build and run vLLM on ARM CPU platform.
FROM ubuntu:22.04 AS cpu-test-arm
ENV CCACHE_DIR=/root/.cache/ccache
ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache
RUN --mount=type=cache,target=/var/cache/apt \
apt-get update -y \
&& apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
# tcmalloc provides better memory allocation efficiency, e.g., holding memory in caches to speed up access of commonly-used objects.
RUN --mount=type=cache,target=/root/.cache/pip \
pip install py-cpuinfo # Use this to gather CPU info and optimize based on ARM Neoverse cores
# Set LD_PRELOAD for tcmalloc on ARM
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
RUN echo 'ulimit -c 0' >> ~/.bashrc
WORKDIR /workspace
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements/build.txt,target=requirements/build.txt \
pip install --upgrade pip && \
pip install -r requirements/build.txt
FROM cpu-test-arm AS build
WORKDIR /workspace/vllm
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements/common.txt,target=requirements/common.txt \
--mount=type=bind,src=requirements/cpu.txt,target=requirements/cpu.txt \
pip install -v -r requirements/cpu.txt
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Disabling AVX512 specific optimizations for ARM
ARG VLLM_CPU_DISABLE_AVX512="true"
ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/ccache \
--mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \
pip install dist/*.whl && \
rm -rf dist
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@ -1,4 +1,11 @@
# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.
# This vLLM Dockerfile is used to build images that can run vLLM on both x86_64 and arm64 CPU platforms.
#
# Supported platforms:
# - linux/amd64 (x86_64)
# - linux/arm64 (aarch64)
#
# Use the `--platform` option with `docker buildx build` to specify the target architecture, e.g.:
# docker buildx build --platform=linux/arm64 -f docker/Dockerfile.cpu .
#
# Build targets:
# vllm-openai (default): used for serving deployment
@ -12,16 +19,14 @@
# VLLM_CPU_AVX512VNNI=false (default)|true
#
######################### BASE IMAGE #########################
FROM ubuntu:22.04 AS base
######################### COMMON BASE IMAGE #########################
FROM ubuntu:22.04 AS base-common
WORKDIR /workspace/
ARG PYTHON_VERSION=3.12
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV LD_PRELOAD=""
# Install minimal dependencies and uv
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
@ -53,7 +58,21 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --upgrade pip && \
uv pip install -r requirements/cpu.txt
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so:$LD_PRELOAD"
ARG TARGETARCH
ENV TARGETARCH=${TARGETARCH}
######################### x86_64 BASE IMAGE #########################
FROM base-common AS base-amd64
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so"
######################### arm64 BASE IMAGE #########################
FROM base-common AS base-arm64
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
######################### BASE IMAGE #########################
FROM base-${TARGETARCH} AS base
RUN echo 'ulimit -c 0' >> ~/.bashrc

View File

@ -1,4 +1,4 @@
ARG NIGHTLY_DATE="20250714"
ARG NIGHTLY_DATE="20250724"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE"
FROM $BASE_IMAGE

View File

@ -56,9 +56,7 @@ nav:
- contributing/model/tests.md
- contributing/model/multimodal.md
- CI: contributing/ci
- Design Documents:
- V0: design
- V1: design/v1
- Design Documents: design
- API Reference:
- Summary: api/README.md
- Contents:

Binary file not shown.

Before

Width:  |  Height:  |  Size: 119 KiB

After

Width:  |  Height:  |  Size: 143 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 227 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

104
docs/configuration/tpu.md Normal file
View File

@ -0,0 +1,104 @@
# TPU Optimization Tips
This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload.
## Get started
Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md).
### TPU workload sizing
When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed.
The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you:
- KV cache size requirement per token and per request
- TPU/GPU memory consumed by the model weights
- TPU/GPU memory allocated for the KV cache
- Maximum \# of requests you can approximately set (--max-num-seqs)
This approach serves as a general rule of thumb.
#### Latency-throughput tradeoff
As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency.
`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request.
Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload.
In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput.
#### Compilation and Caching
Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process.
To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used.
Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs.
Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
#### Reducing compilation time
This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
### Optimize based on your data
#### max model len vs. most model len
![most_model_len](../assets/design/v1/tpu/most_model_len.png)
If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable.
For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`.
The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time.
#### Padding
For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc.
The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about tpu padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests:
1) the default exponential padding (pad to the nearest power of 2)
2) bucket padding (pad to the nearest linearly increasing bucket).
When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`.
For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512].
The fewer tokens we pad, the less unnecessary computation TPU does, the better performance we can get. For example, if num_tokens=300, with exponential padding, we pad to 512, with the bucket_padding above, we pad to 320.
However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
**If possible, use the precision that matches the chips hardware acceleration**
- v5e has int4/int8 hardware acceleration in the MXU
- v6e has int4/int8 hardware acceleration in the MXU
Supported quantized formats and features in vLLM on TPU [Jul '25]
- INT8 W8A8
- INT8 W8A16
- FP8 KV cache
- [WIP] FP8 W8A8
- [WIP] AWQ
- [WIP] FP4 W4A8
**Don't set TP to be less than the number of chips on a single-host deployment**
Although its common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
### Tune your workloads!
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover
#### Profiling
The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
#### SPMD
More details to come.
**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**

View File

@ -26,6 +26,8 @@ See <gh-file:LICENSE>.
## Developing
--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source][build-from-source] documentation for details.
@ -42,7 +44,7 @@ For an optimized workflow when iterating on C++/CUDA kernels, see the [Increment
Install MkDocs along with the [plugins](https://github.com/vllm-project/vllm/blob/main/mkdocs.yaml) used in the vLLM documentation, as well as required dependencies:
```bash
pip install -r requirements/docs.txt
uv pip install -r requirements/docs.txt
```
!!! note
@ -98,13 +100,14 @@ For additional features and advanced configurations, refer to the official [MkDo
??? console "Commands"
```bash
pip install -r requirements/dev.txt
# These commands are only for Nvidia CUDA platforms.
uv pip install -r requirements/common.txt -r requirements/dev.txt --torch-backend=auto
# Linting, formatting and static type checking
pre-commit install --hook-type pre-commit --hook-type commit-msg
pre-commit install
# You can manually run pre-commit with
pre-commit run --all-files
pre-commit run --all-files --show-diff-on-failure
# To manually run something from CI that does not run
# locally by default, you can run:
@ -122,6 +125,10 @@ For additional features and advanced configurations, refer to the official [MkDo
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
!!! note "Install python3-dev if Python.h is missing"
If any of the above commands fails with `Python.h: No such file or directory`, install
`python3-dev` with `sudo apt install python3-dev`.
!!! note
Currently, the repository is not fully checked by `mypy`.
@ -153,7 +160,7 @@ Using `-s` with `git commit` will automatically add this header.
!!! tip
You can enable automatic sign-off via your IDE:
- **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window.
It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`.
- **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings)

View File

@ -57,8 +57,7 @@ cc the PyTorch release team to initiate discussion on how to address them.
## Update CUDA version
The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example,
`torch2.7.0+cu12.6`) is uploaded to PyPI. However, vLLM may require a different CUDA version,
The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, torch `2.7.1+cu126`) is uploaded to PyPI. However, vLLM may require a different CUDA version,
such as 12.8 for Blackwell support.
This complicates the process as we cannot use the out-of-the-box
`pip install torch torchvision torchaudio` command. The solution is to use
@ -134,7 +133,7 @@ MAX_JOBS=16 uv pip install --system \
```bash
uv pip install --system \
--no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
--no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.5"
```
### causal-conv1d

View File

@ -9,10 +9,13 @@ We support tracing vLLM workers using the `torch.profiler` module. You can enabl
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
When using `vllm bench serve`, you can enable profiling by passing the `--profile` flag.
Traces can be visualized using <https://ui.perfetto.dev/>.
!!! tip
You can directly call bench module without installing vllm using `python -m vllm.entrypoints.cli.main bench`.
!!! tip
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
@ -35,10 +38,10 @@ VLLM_TORCH_PROFILER_DIR=./vllm_profile \
--model meta-llama/Meta-Llama-3-70B
```
benchmark_serving.py:
vllm bench command:
```bash
python benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \
@ -69,13 +72,13 @@ apt install nsight-systems-cli
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
The following is an example using the `benchmarks/benchmark_latency.py` script:
The following is an example using the `vllm bench latency` script:
```bash
nsys profile -o report.nsys-rep \
--trace-fork-before-exec=true \
--cuda-graph-trace=node \
python benchmarks/benchmark_latency.py \
vllm bench latency \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-iters-warmup 5 \
--num-iters 1 \
@ -98,7 +101,7 @@ nsys profile -o report.nsys-rep \
vllm serve meta-llama/Llama-3.1-8B-Instruct
# client
python benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1 \
@ -132,7 +135,7 @@ You can view these profiles either as summaries in the CLI, using `nsys stats [p
...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
@ -143,7 +146,7 @@ You can view these profiles either as summaries in the CLI, using `nsys stats [p
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
...
...
```
GUI example:

View File

@ -1,40 +0,0 @@
# Automatic Prefix Caching
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
```text
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
```
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
```text
hash(prefix tokens + block tokens) <--> KV Block
```
With this mapping, we can add another indirection in vLLMs KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
## Generalized Caching Policy
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy:
* When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0.
* If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU).
* If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it).
Note that this eviction policy effectively implements the exact policy as in [RadixAttention](https://lmsys.org/blog/2024-01-17-sglang/) when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree.
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.

View File

@ -0,0 +1,236 @@
# Fused MoE Modular Kernel
## Introduction
FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py)
Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.
* Contiguous / Standard / Non-Batched, and
* Batched
!!! note
The terms Contiguous, Standard, and Non-Batched are used interchangeably throughout the document.
The input activation format completely depends on the All2All Dispatch being used.
* In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example.
* In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example.
The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below
![](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png "FusedMoE Non-Batched")
![](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png "FusedMoE Batched")
!!! note
The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain.
## Motivation
As can be seen from the diagrams, there are a lot of operations and there can be a variety of implementations for each operation. The set of ways the operations can be put together to make a valid FusedMoE implementation quickly becomes intractable. The Modular Kernel framework addresses this issue, by grouping the operations into logical components. This broad categorization makes the combinations manageable and prevents code-duplication. This also decouples the All2All Dispatch & Combine implementations from the FusedMoE implementations and allows for their independent development and testing. Furthermore, the Modular Kernel framework introduces Abstract classes for the different components thus providing a well-defined skeleton for future implementations.
The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward.
## ModularKernel Components:
FusedMoEModularKernel splits the FusedMoE operation into 3 parts,
1. TopKWeightAndReduce
2. FusedMoEPrepareAndFinalize
3. FusedMoEPermuteExpertsUnpermute
### TopKWeightAndReduce
The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class.
Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method.
The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens.
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself.
* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction.
### FusedMoEPrepareAndFinalize
The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare` and `finalize` functions.
The `prepare` function is responsible for input activation Quantization and All2All Dispatch. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section)
![](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png "FusedMoEPrepareAndFinalize Blocks")
### FusedMoEPermuteExpertsUnpermute
The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions,
* apply()
* workspace_shapes()
* finalize_weight_and_reduce_impl()
#### apply()
The `apply` method is where the implementations perform
* Permute
* Matmul with weight W1
* Act + Mul
* Quantization
* Matmul with weight W2
* Unpermute
* Maybe TopK Weight Application + Reduction
#### workspace_shapes()
The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation.
#### finalize_weight_and_reduce_impl()
It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section.
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use.
![](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png "FusedMoEPermuteExpertsUnpermute Blocks")
### FusedMoEModularKernel
`FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects.
`FusedMoEModularKernel` pseudocode/sketch,
```
FusedMoEModularKernel::__init__(self,
prepare_finalize: FusedMoEPrepareAndFinalize,
fused_experts: FusedMoEPermuteExpertsUnpermute):
self.prepare_finalize = prepare_finalize
self.fused_experts = fused_experts
FusedMoEModularKernel::forward(self, DP_A):
Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...)
workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...)
# allocate workspaces
workspace_13 = torch.empty(workspace13_shape, ...)
workspace_2 = torch.empty(workspace2_shape, ...)
# execute fused_experts
fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...)
# war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations performs the TopK Weight Application and Reduction.
war_impl = self.fused_experts.finalize_weight_and_reduce_impl()
output = self.prepare_finalize.finalize(fe_out, war_impl,...)
return output
```
## How-To
### How To Add a FusedMoEPrepareAndFinalize Type
Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example,
* PplxPrepareAndFinalize type is backed by Pplx All2All kernels,
* DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughtput All2All kernels, and
* DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels.
#### Step 1: Add an All2All manager
The purpose of the All2All Manager is to setup the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py).
#### Step 2: Add a FusedMoEPrepareAndFinalize Type
This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class.
`FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked.
`FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked.
`FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise.
`FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None.
`FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once.
`FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank().
We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference.
### How To Add a FusedMoEPermuteExpertsUnpermute Type
FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows,
`FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format.
`FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically
implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not.
`FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map.
`FusedMoEPermuteExpertsUnpermute::workspace_shapes()` /
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` /
`FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above.
### FusedMoEModularKernel Initialization
`FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are,
* select_gemm_impl, and
* init_prepare_finalize
#### select_gemm_impl
The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object.
Please refer to the implementations in,
* `UnquantizedFusedMoEMethod`
* `CompressedTensorsW8A8Fp8MoEMethod`
* `CompressedTensorsW8A8Fp8MoECutlassMethod`
* `Fp8MoEMethod`
* `ModelOptNvFp4FusedMoE`
dervied classes.
#### init_prepare_finalize
Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object
Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188).
**Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used.
### How To Unit Test
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py).
The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are
compatible, runs some correctness tests.
If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations,
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`,
`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`,
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py)
Doing this will add the new implementation to the test suite.
### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility
The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
As a side-effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked
with incompatible types, the script will error.
### How To Profile
Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible
`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types.
Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
## FusedMoEPrepareAndFinalize Implementations
The following table lists the `FusedMoEPrepareAndFinalize` implementations at the time of writing,
| Implementation | Type | Comments |
| :--- | :--- | :--- |
| DeepEPHTPrepareAndFinalize | Contiguous / Non-Batched | Uses the DeepEP High-Throughput all2all kernels. |
| DeepEPLLPrepareAndFinalize | Batched | Uses the DeepEP Low-Latency all2all kernels. |
| PplxPrepareAndFinalize | Batched | Uses the Perplexity all2all kernels. |
| FlashInferCutlassMoEPrepareAndFinalize | Contiguous | |
| MoEPrepareAndFinalizeNoEP | Contiguous | This implementation is used when there is no EP. i.e. no all2all kernels are invoked. |
| BatchedPrepareAndFinalize | Batched | A reference prepare/finalize class that reorganizes the tokens into expert batched format, i.e. E x max_num_tokens x K. (Doesnt use any all2all kernels. This is primarily used in unit testing) |
## FusedMoEPermuteExpertsUnpermute
The following table lists the `FusedMoEPermuteExpertsUnpermute` implementations at the time of writing,
| Implementation | Type | Comment |
| :--- | :--- | :--- |
| BatchedDeepGemmExperts | Batched | Uses the DeepGemms Masked Grouped Gemm kernels for the fused_moe operation. |
| BatchedTritonExperts | Batched | Uses a Triton Kernel for the Batched matmuls. |
| BatchedTritonOrDeepGemmExperts | Batched | Chooses either the `BatchedDeepGemmExperts` or `BatchedTritonExperts` based on environment settings. |
| DeepGemmExperts | Contiguous / Non-Batched | Uses DeepGemms Grouped Gemm kernels for fused_moe operation. |
| TritonExperts | Contiguous / Non-Batched | Uses a Triton Kernel for fused_moe matmuls. |
| TritonOrDeepGemmExperts | Contiguous / Non-Batched | Chooses either the `DeepGemmExperts` or `TritonExperts` based on fused_moe inputs. |
| CutlassExpertsFP8 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp8 matmuls. |
| CutlassExpertsFP4 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp4 matmuls. |
| FlashInferExperts | Contiguous | Uses fused_moe operation from FlashInfer |
| NaiveBatchedExperts | Batched | Reference Batched Experts implementation. Primarily used in unit tests. |

View File

@ -1,4 +1,4 @@
# Integration with HuggingFace
# Integration with Hugging Face
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.

View File

@ -1,21 +1,23 @@
# P2P NCCL Connector
An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo.
# Detailed Design
## Detailed Design
## Overall Process
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
### Overall Process
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**.
3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**.
4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`.
5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**.
6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**.
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**.
3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**.
4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`.
5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**.
6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**.
7. After completing **Decode**, the D instance returns the result to the **Proxy/Router**, which then forwards it to the **client**.
![image1](https://github.com/user-attachments/assets/fb01bde6-755b-49f7-ad45-48a94b1e10a7)
## Proxy/Router (Demo)
### Proxy/Router (Demo)
A simple HTTP service acts as the entry point for client requests and starts a background thread to listen for P/D instances reporting their HTTP IP and PORT, as well as ZMQ IP and PORT. It maintains a dictionary of `http_addr -> zmq_addr`. The `http_addr` is the IP:PORT for the vLLM instance's request, while the `zmq_addr` is the address for KV cache handshake and metadata reception.
@ -29,13 +31,13 @@ Currently, to quickly verify whether xPyD can work, a round-robin selection of 1
Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (currently every 3 seconds) to register (i.e., report `http_addr -> zmq_addr`) and keep the connection alive. If an instance crashes and fails to send a ping for a certain period of time, the Proxy/Router will remove the timed-out instance (this feature has not yet been developed).
## KV Cache Transfer Methods
### KV Cache Transfer Methods
There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache.
Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
## P2P Communication via ZMQ & NCCL
### P2P Communication via ZMQ & NCCL
As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart.
@ -43,7 +45,7 @@ Each P/D instance only needs to create a single `P2pNcclEngine` instance. This i
When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size.
## NCCL Group Topology
### NCCL Group Topology
Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance.
@ -51,7 +53,7 @@ Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCa
Each NCCL group occupies a certain amount of GPU memory buffer for communication, the size of which is primarily influenced by the `NCCL_MAX_NCHANNELS` environment variable. When `NCCL_MAX_NCHANNELS=16`, an NCCL group typically occupies 100MB, while when `NCCL_MAX_NCHANNELS=8`, it usually takes up 52MB. For large-scale xPyD configurations—such as DeepSeek's 96P144D—this implementation is currently not feasible. Moving forward, we are considering using RDMA for point-to-point communication and are also keeping an eye on UCCL.
## GPU Memory Buffer and Tensor Memory Pool
### GPU Memory Buffer and Tensor Memory Pool
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%10% of the memory size.
@ -59,15 +61,15 @@ If the `--max-num-seqs` parameter for P instances is set to a large value, due t
To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store.
# Install vLLM
## Install vLLM
```shell
pip install "vllm>=0.9.2"
```
# Run xPyD
## Run xPyD
## Instructions
### Instructions
- The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model.
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
@ -79,16 +81,16 @@ pip install "vllm>=0.9.2"
- Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`.
- In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**.
## Run 1P3D
### Run 1P3D
### Proxy (e.g. 10.0.1.1)
#### Proxy (e.g. 10.0.1.1)
```shell
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
python3 disagg_proxy_p2p_nccl_xpyd.py &
```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
??? console "Command"
@ -110,7 +112,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
```
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
#### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
??? console "Command"
@ -132,7 +134,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
```
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
#### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
??? console "Command"
@ -154,7 +156,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
```
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
#### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
??? console "Command"
@ -176,16 +178,16 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
```
## Run 3P1D
### Run 3P1D
### Proxy (e.g. 10.0.1.1)
#### Proxy (e.g. 10.0.1.1)
```shell
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
python3 disagg_proxy_p2p_nccl_xpyd.py &
```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
??? console "Command"
@ -207,7 +209,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
```
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
#### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
??? console "Command"
@ -229,7 +231,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
```
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
#### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
??? console "Command"
@ -251,7 +253,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
```
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
#### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
??? console "Command"
@ -273,7 +275,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
```
# Single request
## Single request
```shell
curl -X POST -s http://10.0.1.1:10001/v1/completions \
@ -286,12 +288,12 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
}'
```
# Benchmark
## Benchmark
??? console "Command"
```shell
python3 benchmark_serving.py \
vllm bench serve \
--backend vllm \
--model base_model \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
@ -310,14 +312,14 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
--num-prompts 1000
```
# Shut down
## Shut down
```shell
pgrep python | xargs kill -9 && pkill -f python
```
# Test data
## Test data
## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s
### **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s
![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1)

View File

@ -1,5 +1,9 @@
# vLLM Paged Attention
!!! warning
This document is being kept in the vLLM documentation for historical purposes.
It no longer describes the code used in vLLM today.
Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`).
This kernel is designed to be compatible with

View File

@ -34,23 +34,25 @@ th:not(:first-child) {
}
</style>
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | | ❌ | | ✅ | | | | | | | | |
| [pooling](../models/pooling_models.md) | ✅\* | ✅\* | | ❌ | | ✅ | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
| multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
| best-of | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | |
| beam-search | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
\* Chunked prefill and prefix caching are only applicable to last-token pooling.
[](){ #feature-x-hardware }
## Feature x Hardware
@ -62,9 +64,9 @@ th:not(:first-child) {
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |

View File

@ -177,6 +177,70 @@ Multi-image input can be extended to perform video captioning. We show this with
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input.
Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown in this example using Qwen2.5-VL:
??? code
```python
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
model_path = "Qwen/Qwen2.5-VL-3B-Instruct/"
video_path = "https://content.pexels.com/videos/free-videos.mp4"
llm = LLM(
model=model_path,
gpu_memory_utilization=0.8,
enforce_eager=True,
limit_mm_per_prompt={"video": 1},
)
sampling_params = SamplingParams(
max_tokens=1024,
)
video_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "text", "text": "describe this video."},
{
"type": "video",
"video": video_path,
"total_pixels": 20480 * 28 * 28,
"min_pixels": 16 * 28 * 28
}
]
},
]
messages = video_messages
processor = AutoProcessor.from_pretrained(model_path)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
mm_data = {}
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
!!! note
'process_vision_info' is only applicable to Qwen2.5-VL and similar models.
Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio Inputs
@ -279,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:
```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
```
@ -358,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
First, launch the OpenAI-compatible server:
```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
```
Then, you can use the OpenAI client as follows:

View File

@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
First, launch the OpenAI-compatible server:
```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
--max-model-len 4096 --enable-prompt-embeds
```

View File

@ -6,6 +6,7 @@ Contents:
- [Supported Hardware](supported_hardware.md)
- [AutoAWQ](auto_awq.md)
- [AutoRound](auto_round.md)
- [BitsAndBytes](bnb.md)
- [BitBLAS](bitblas.md)
- [GGUF](gguf.md)

View File

@ -0,0 +1,103 @@
# AutoRound
[AutoRound](https://github.com/intel/auto-round) is Intels advanced quantization algorithm designed to produce highly efficient **INT2, INT3, INT4, and INT8**
quantized large language models—striking an optimal balance between accuracy and deployment performance.
AutoRound applies weight-only quantization to transformer-based models, enabling significant memory savings and faster
inference while maintaining near-original accuracy. It supports a wide range of hardware platforms, including **CPUs,
Intel GPUs, HPUs, and CUDA-enabled devices**.
Please refer to the [AutoRound guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md) for more details.
Key Features:
**AutoRound, AutoAWQ, AutoGPTQ, and GGUF** are supported
**10+ vision-language models (VLMs)** are supported
**Per-layer mixed-bit quantization** for fine-grained control
**RTN (Round-To-Nearest) mode** for quick quantization with slight accuracy loss
**Multiple quantization recipes**: best, base, and light
✅ Advanced utilities such as immediate packing and support for **10+ backends**
## Installation
```bash
uv pip install auto-round
```
## Quantizing a model
For VLMs, please change to `auto-round-mllm` in CLI usage and `AutoRoundMLLM` in API usage.
### CLI usage
```bash
auto-round \
--model Qwen/Qwen3-0.6B \
--bits 4 \
--group_size 128 \
--format "auto_round" \
--output_dir ./tmp_autoround
```
```bash
auto-round \
--model Qwen/Qwen3-0.6B \
--format "gguf:q4_k_m" \
--output_dir ./tmp_autoround
```
### API usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
# the best accuracy, 4-5X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
# 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym )
output_dir = "./tmp_autoround"
# format= 'auto_round'(default), 'auto_gptq', 'auto_awq'
autoround.quantize_and_save(output_dir, format="auto_round")
```
## Running a quantized model with vLLM
Here is some example code to run auto-round format in vLLM:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
# Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and
ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

View File

@ -2,19 +2,26 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
<style>
th {
white-space: nowrap;
min-width: 0 !important;
}
</style>
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ |
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.

View File

@ -103,7 +103,8 @@ When tool_choice='required' is set, the model is guaranteed to generate one or m
vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
However, when `tool_choice='none'` is specified, vLLM includes tool definitions from the prompt.
!!! note
When tools are specified in the request, vLLM includes tool definitions in the prompt by default, regardless of the `tool_choice` setting. To exclude tool definitions when `tool_choice='none'`, use the `--exclude-tools-when-tool-choice-none` option.
## Automatic Function Calling

View File

@ -33,7 +33,7 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
# --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source]
```bash
docker build -f docker/Dockerfile.arm \
docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-env .
# Launching OpenAI server

View File

@ -20,16 +20,16 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
# --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels]
You can install vLLM using either `pip` or `uv pip`:
```bash
# Install vLLM with CUDA 12.8.
# If you are using pip.
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
# If you are using uv.
uv pip install vllm --torch-backend=auto
```
??? console "pip"
```bash
# Install vLLM with CUDA 12.8.
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
```
We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first.
!!! note
@ -38,10 +38,10 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
```bash
# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.6.1.post1
export PYTHON_VERSION=312
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
# Install vLLM with a specific CUDA version (e.g., 11.8 or 12.6).
export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//')
export CUDA_VERSION=118 # or 126
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
```
[](){ #install-the-latest-code }
@ -50,36 +50,22 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since `v0.5.3`.
##### Install the latest code using `pip`
```bash
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
```
`--pre` is required for `pip` to consider pre-released versions.
Another way to install the latest code is to use `uv`:
```bash
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
```
##### Install specific revisions using `pip`
??? console "pip"
```bash
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
```
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
`--pre` is required for `pip` to consider pre-released versions.
```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
```
Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
##### Install specific revisions using `uv`
##### Install specific revisions
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
@ -92,17 +78,35 @@ uv pip install vllm \
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
??? note "pip"
If you want to access the wheels for previous commits (e.g. to bisect the behavior change,
performance regression), due to the limitation of `pip`, you have to specify the full URL of the
wheel file by embedding the commit hash in the URL:
```bash
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
```
Note that the wheels are built with Python 3.8 ABI (see [PEP
425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible
with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a
placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in
the wheel metadata (the wheels listed in the extra index url have correct versions). Although we
don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the
wheels are still built with Python 3.8 ABI to keep the same wheel name as before.
# --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation)
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
If you only need to change Python code, you can build and install vLLM without compilation. Using `uv pip`'s [`--editable` flag](https://docs.astral.sh/uv/pip/packages/#editable-packages), changes you make to the code will be reflected when you run vLLM:
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
```
This command will do the following:
@ -121,7 +125,7 @@ In case you see an error about wheel not found when running the above command, i
```bash
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .
uv pip install --editable .
```
You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
@ -137,7 +141,7 @@ If you want to modify C++ or CUDA code, you'll need to build vLLM from source. T
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
uv pip install -e .
```
!!! tip
@ -152,14 +156,14 @@ pip install -e .
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
!!! note "Faster Kernel Development"
For frequent C++/CUDA kernel changes, after the initial `pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code.
For frequent C++/CUDA kernel changes, after the initial `uv pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code.
##### Use an existing PyTorch installation
There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.:
There are scenarios where the PyTorch dependency cannot be easily installed with `uv`, e.g.:
- Building vLLM with PyTorch nightly or a custom PyTorch build.
- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124` to [install PyTorch nightly](https://pytorch.org/get-started/locally/), and then build vLLM on top of it.
- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `uv pip install --index-url https://download.pytorch.org/whl/nightly/cu128 torch torchvision torchaudio` to [install PyTorch nightly](https://pytorch.org/get-started/locally/) and then build vLLM on top of it.
To build vLLM using an existing PyTorch installation:
@ -167,8 +171,8 @@ To build vLLM using an existing PyTorch installation:
git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install --no-build-isolation -e .
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .
```
##### Use the local cutlass for compilation
@ -179,7 +183,7 @@ To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to po
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass uv pip install -e .
```
##### Troubleshooting
@ -189,7 +193,7 @@ to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
```bash
export MAX_JOBS=6
pip install -e .
uv pip install -e .
```
This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory.
@ -228,7 +232,7 @@ Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
```bash
export VLLM_TARGET_DEVICE=empty
pip install -e .
uv pip install -e .
```
# --8<-- [end:build-wheel-from-source]

View File

@ -62,6 +62,11 @@ class MarkdownFormatter(HelpFormatter):
choices = f'`{"`, `".join(str(c) for c in choices)}`'
self._markdown_output.append(
f"Possible choices: {choices}\n\n")
elif ((metavar := action.metavar)
and isinstance(metavar, (list, tuple))):
metavar = f'`{"`, `".join(str(m) for m in metavar)}`'
self._markdown_output.append(
f"Possible choices: {metavar}\n\n")
self._markdown_output.append(f"{action.help}\n\n")

View File

@ -5,9 +5,98 @@ vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or
at runtime extremely quickly directly to the GPU, resulting in significantly
shorter Pod startup times and CPU memory usage. Tensor encryption is also supported.
For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](../../examples/others/tensorize_vllm_model.md).
vLLM fully integrates Tensorizer in to its model loading machinery. The following will give a brief overview on how to get started with using Tensorizer on vLLM.
!!! note
Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
## Installing Tensorizer
To install `tensorizer`, run `pip install vllm[tensorizer]`.
## The basics
To load a model using Tensorizer, the model first needs to be serialized by
Tensorizer. [The example script](../../examples/others/tensorize_vllm_model.md) takes care of this process.
Let's walk through a basic example by serializing `facebook/opt-125m` using the script, and then loading it for inference.
## Serializing a vLLM model with Tensorizer
To serialize a model with Tensorizer, call the example script with the necessary
CLI arguments. The docstring for the script itself explains the CLI args
and how to use it properly in great detail, and we'll use one of the examples from the docstring directly, assuming we want to serialize and save our model at our S3 bucket example `s3://my-bucket`:
```bash
python examples/others/tensorize_vllm_model.py \
--model facebook/opt-125m \
serialize \
--serialized-directory s3://my-bucket \
--suffix v1
```
This saves the model tensors at `s3://my-bucket/vllm/facebook/opt-125m/v1`. If you intend on applying a LoRA adapter to your tensorized model, you can pass the HF id of the LoRA adapter in the above command, and the artifacts will be saved there too:
```bash
python examples/others/tensorize_vllm_model.py \
--model facebook/opt-125m \
--lora-path <lora_id> \
serialize \
--serialized-directory s3://my-bucket \
--suffix v1
```
## Serving the model using Tensorizer
Once the model is serialized where you want it, you can load the model using `vllm serve` or the `LLM` entrypoint. You can pass the directory where you saved the model to the `model` argument for `LLM()` and `vllm serve`. For example, to serve the tensorized model saved previously with the LoRA adapter, you'd do:
```bash
vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \
--load-format tensorizer \
--enable-lora
```
Or, with `LLM()`:
```python
from vllm import LLM
llm = LLM(
"s3://my-bucket/vllm/facebook/opt-125m/v1",
load_format="tensorizer",
enable_lora=True
)
```
## Options for configuring Tensorizer
`tensorizer`'s core objects that serialize and deserialize models are `TensorSerializer` and `TensorDeserializer` respectively. In order to pass arbitrary kwargs to these, which will configure the serialization and deserialization processes, you can provide them as keys to `model_loader_extra_config` with `serialization_kwargs` and `deserialization_kwargs` respectively. Full docstrings detailing all parameters for the aforementioned objects can be found in `tensorizer`'s [serialization.py](https://github.com/coreweave/tensorizer/blob/main/tensorizer/serialization.py) file.
As an example, CPU concurrency can be limited when serializing with `tensorizer` via the `limit_cpu_concurrency` parameter in the initializer for `TensorSerializer`. To set `limit_cpu_concurrency` to some arbitrary value, you would do so like this when serializing:
```bash
python examples/others/tensorize_vllm_model.py \
--model facebook/opt-125m \
--lora-path <lora_id> \
serialize \
--serialized-directory s3://my-bucket \
--serialization-kwargs '{"limit_cpu_concurrency": 2}' \
--suffix v1
```
As an example when customizing the loading process via `TensorDeserializer`, you could limit the number of concurrency readers during deserialization with the `num_readers` parameter in the initializer via `model_loader_extra_config` like so:
```bash
vllm serve s3://my-bucket/vllm/facebook/opt-125m/v1 \
--load-format tensorizer \
--enable-lora \
--model-loader-extra-config '{"deserialization_kwargs": {"num_readers": 2}}'
```
Or with `LLM()`:
```python
from vllm import LLM
llm = LLM(
"s3://my-bucket/vllm/facebook/opt-125m/v1",
load_format="tensorizer",
enable_lora=True,
model_loader_extra_config={"deserialization_kwargs": {"num_readers": 2}}
)
```

View File

@ -2,12 +2,19 @@
vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Configuration
### Model Runner (`--runner`)
Run a model in generation mode via the option `--runner generate`.
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
## Offline Inference

View File

@ -1,28 +1,49 @@
# Pooling Models
vLLM also supports pooling models, including embedding, reranking and reward models.
vLLM also supports pooling models, such as embedding, classification and reward models.
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them.
!!! note
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
If the model doesn't implement this interface, you can set `--task` which tells vLLM
to convert the model into a pooling model.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions!
| `--task` | Model type | Supported pooling tasks |
|------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |
## Configuration
## Pooling Tasks
### Model Runner
In vLLM, we define the following pooling tasks and corresponding APIs:
Run a model in pooling mode via the option `--runner pooling`.
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
### Model Conversion
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.
| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.
### Pooling Tasks
Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:
| Task | APIs |
|------------|--------------------|
@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
| `classify` | `classify` |
| `score` | `score` |
\*The `score` API falls back to `embed` task if the model does not support `score` task.
\* The `score` API falls back to `embed` task if the model does not support `score` task.
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
### Pooler Configuration
By default, the pooler assigned to each task has the following attributes:
#### Predefined models
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--override-pooler-config` option.
#### Converted models
If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:
| Task | Pooling Type | Normalization | Softmax |
|------------|----------------|---------------|---------|
@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
| `embed` | `LAST` | ✅︎ | ❌ |
| `classify` | `LAST` | ❌ | ✅︎ |
These defaults may be overridden by the model's implementation in vLLM.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
which takes priority over the model's defaults.
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
You can further customize this via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
!!! note
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
## Offline Inference
The [LLM][vllm.LLM] class provides various methods for offline inference.
@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
```python
from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
(output,) = llm.encode("Hello, my name is")
data = output.outputs.data
@ -85,7 +106,7 @@ It is primarily designed for embedding models.
```python
from vllm import LLM
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
@ -102,7 +123,7 @@ It is primarily designed for classification models.
```python
from vllm import LLM
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")
probs = output.outputs.probs
@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
```python
from vllm import LLM
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.")
@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
from vllm import LLM, PoolingParams
llm = LLM(model="jinaai/jina-embeddings-v3",
task="embed",
runner="pooling",
trust_remote_code=True)
outputs = llm.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32))

View File

@ -1,7 +1,6 @@
# Supported Models
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
```python
from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model
llm = LLM(model=...) # Name or path of your model
llm.apply_model(lambda model: print(type(model)))
```
@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
```python
from vllm import LLM
# For generative models (task=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model
# For generative models (runner=generate) only
llm = LLM(model=..., runner="generate") # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
llm = LLM(model=..., task="embed") # Name or path of your model
# For pooling models (runner=pooling) only
llm = LLM(model=..., runner="pooling") # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
```
@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
```python
from vllm import LLM
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
# For generative models (task=generate) only
# For generative models (runner=generate) only
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
# For pooling models (runner=pooling) only
output = llm.encode("Hello, my name is")
print(output)
```
@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
Specified using `--task generate`.
<style>
th {
white-space: nowrap;
@ -339,7 +336,7 @@ th {
| `DeepseekV2ForCausalLM` | DeepSeek-V2 | `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat`, etc. | | ✅︎ | ✅︎ |
| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3`, etc. | | ✅︎ | ✅︎ |
| `Dots1ForCausalLM` | dots.llm1 | `rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, etc. | | ✅︎ | ✅︎ |
| `Ernie4_5_ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Ernie4_5ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. |✅︎| ✅︎ | ✅︎ |
| `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Exaone4ForCausalLM` | EXAONE-4 | `LGAI-EXAONE/EXAONE-4.0-32B`, etc. | ✅︎ | ✅︎ | ✅︎ |
@ -365,6 +362,7 @@ th {
| `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ | ✅︎ |
| `HunYuanDenseV1ForCausalLM` | Hunyuan-7B-Instruct-0124 | `tencent/Hunyuan-7B-Instruct-0124` | ✅︎ | | ✅︎ |
| `HunYuanMoEV1ForCausalLM` | Hunyuan-80B-A13B | `tencent/Hunyuan-A13B-Instruct`, `tencent/Hunyuan-A13B-Pretrain`, `tencent/Hunyuan-A13B-Instruct-FP8`, etc. | ✅︎ | | ✅︎ |
| `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision-Instruct-3B | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | | | ✅︎ |
| `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `InternLM3ForCausalLM` | InternLM3 | `internlm/internlm3-8b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
@ -391,7 +389,7 @@ th {
| `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Phi4FlashForCausalLM` | Phi-4-mini-flash-reasoning | `microsoft/microsoft/Phi-4-mini-instruct`, etc. | | | |
| `PersimmonForCausalLM` | Persimmon | `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. | | ✅︎ | ✅︎ |
| `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | | | |
| `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | | ✅︎ | |
| `QWenLMHeadModel` | Qwen | `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2MoeForCausalLM` | Qwen2MoE | `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ |
@ -419,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding
Specified using `--task embed`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
!!! note
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
@ -459,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
#### Reward Modeling
Specified using `--task reward`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
@ -477,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
#### Classification
Specified using `--task classify`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring
Specified using `--task score`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) |
|--------------|--------|-------------------|---------------------|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
!!! note
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
@ -574,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
Specified using `--task generate`.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
@ -592,6 +595,7 @@ Specified using `--task generate`.
| `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ |
| `H2OVLChatModel` | H2OVL | T + I<sup>E+</sup> | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | ✅︎ |
| `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3`, etc. | ✅︎ | | ✅︎ |
| `InternS1ForConditionalGeneration` | Intern-S1 | T + I<sup>E+</sup> + V<sup>E+</sup> | `internlm/Intern-S1`, etc. | | ✅︎ | ✅︎ |
| `InternVLChatModel` | InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0 | T + I<sup>E+</sup> + (V<sup>E+</sup>) | `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `KeyeForConditionalGeneration` | Keye-VL-8B-Preview | T + I<sup>E+</sup> + V<sup>E+</sup> | `Kwai-Keye/Keye-VL-8B-Preview` | | | ✅︎ |
| `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I<sup>+</sup> | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | | ✅︎ |
@ -612,6 +616,7 @@ Specified using `--task generate`.
| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + I<sup>E</sup> | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | | ✅︎ | ⚠️ |
| `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + I<sup>E+</sup> | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | | ✅︎ | ✅︎ |
| `Phi4MMForCausalLM` | Phi-4-multimodal | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Phi4MultimodalForCausalLM` | Phi-4-multimodal (HF Transformers) | T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup> | `microsoft/Phi-4-multimodal-instruct` (with revision `refs/pr/70`), etc. | ✅︎ | ✅︎ | ✅︎ |
| `PixtralForConditionalGeneration` | Mistral 3 (Mistral format), Pixtral (Mistral format) | T + I<sup>+</sup> | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistralai/Pixtral-12B-2409`, etc. | | ✅︎ | ✅︎ |
| `QwenVLForConditionalGeneration`<sup>^</sup> | Qwen-VL | T + I<sup>E+</sup> | `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2AudioForConditionalGeneration` | Qwen2-Audio | T + A<sup>+</sup> | `Qwen/Qwen2-Audio-7B-Instruct` | | ✅︎ | ✅︎ |
@ -623,10 +628,16 @@ Specified using `--task generate`.
| `TarsierForConditionalGeneration` | Tarsier | T + I<sup>E+</sup> | `omni-search/Tarsier-7b`, `omni-search/Tarsier-34b` | | ✅︎ | ✅︎ |
| `Tarsier2ForConditionalGeneration`<sup>^</sup> | Tarsier2 | T + I<sup>E+</sup> + V<sup>E+</sup> | `omni-research/Tarsier2-Recap-7b`, `omni-research/Tarsier2-7b-0115` | | ✅︎ | ✅︎ |
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
Some models are supported only via the [Transformers backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------|---------------------|
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ | ✅︎ |
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
!!! warning
@ -696,8 +707,6 @@ Specified using `--task generate`.
#### Transcription
Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
@ -710,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding
Specified using `--task embed`.
Any text generation model can be converted into an embedding model by passing `--task embed`.
!!! note
To get the best results, you should use pooling models that are specifically trained as such.
@ -725,19 +730,24 @@ The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
---
#### Scoring
Specified using `--task score`.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
## Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Heres how we manage third-party model support:

View File

@ -1,31 +1,38 @@
# Distributed Inference and Serving
# Distributed inference and serving
## How to decide the distributed inference strategy?
## Distributed inference strategies for a single-model replica
Before going into the details of distributed inference and serving, let's first make it clear when to use distributed inference and what are the strategies available. The common practice is:
To choose a distributed inference strategy for a single-model replica, use the following guidelines:
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
- **Single GPU (no distributed inference):** if the model fits on a single GPU, distributed inference is probably unnecessary. Run inference on that GPU.
- **Single-node multi-GPU using tensor parallel inference:** if the model is too large for a single GPU but fits on a single node with multiple GPUs, use *tensor parallelism*. For example, set `tensor_parallel_size=4` when using a node with 4 GPUs.
- **Multi-node multi-GPU using tensor parallel and pipeline parallel inference:** if the model is too large for a single node, combine *tensor parallelism* with *pipeline parallelism*. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes. For example, set `tensor_parallel_size=8` and `pipeline_parallel_size=2` when using 2 nodes with 8 GPUs per node.
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
Increase the number of GPUs and nodes until there is enough GPU memory for the model. Set `tensor_parallel_size` to the number of GPUs per node and `pipeline_parallel_size` to the number of nodes.
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
After you provision sufficient resources to fit the model, run `vllm`. Look for log messages like:
!!! note
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
```text
INFO 07-23 13:56:04 [kv_cache_utils.py:775] GPU KV cache size: 643,232 tokens
INFO 07-23 13:56:04 [kv_cache_utils.py:779] Maximum concurrency for 40,960 tokens per request: 15.70x
```
### Distributed serving of MoE (Mixture of Experts) models
The `GPU KV cache size` line reports the total number of tokens that can be stored in the GPU KV cache at once. The `Maximum concurrency` line provides an estimate of how many requests can be served concurrently if each request requires the specified number of tokens (40,960 in the example above). The tokens-per-request number is taken from the model configuration's maximum sequence length, `ModelConfig.max_model_len`. If these numbers are lower than your throughput requirements, add more GPUs or nodes to your cluster.
It is often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. See the page on [Data Parallel Deployment](data_parallel_deployment.md) for more information.
!!! note "Edge case: uneven GPU splits"
If the model fits within a single node but the GPU count doesn't evenly divide the model size, enable pipeline parallelism, which splits the model along layers and supports uneven splits. In this scenario, set `tensor_parallel_size=1` and `pipeline_parallel_size` to the number of GPUs. Furthermore, if the GPUs on the node do not have NVLINK interconnect (e.g. L40S), leverage pipeline parallelism instead of tensor parallelism for higher throughput and lower communication overhead.
## Running vLLM on a single node
### Distributed serving of *Mixture of Experts* (*MoE*) models
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. Currently, we support [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf). We manage the distributed runtime with either [Ray](https://github.com/ray-project/ray) or python native multiprocessing. Multiprocessing can be used when deploying on a single node, multi-node inference currently requires Ray.
It's often advantageous to exploit the inherent parallelism of experts by using a separate parallelism strategy for the expert layers. vLLM supports large-scale deployment combining Data Parallel attention with Expert or Tensor Parallel MoE layers. For more information, see [Data Parallel Deployment](data_parallel_deployment.md).
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured `tensor_parallel_size`, otherwise Ray will be used. This default can be overridden via the `LLM` class `distributed_executor_backend` argument or `--distributed-executor-backend` API server argument. Set it to `mp` for multiprocessing or `ray` for Ray. It's not required for Ray to be installed for the multiprocessing case.
## Single-node deployment
To run multi-GPU inference with the `LLM` class, set the `tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
vLLM supports distributed tensor-parallel and pipeline-parallel inference and serving. The implementation includes [Megatron-LM's tensor parallel algorithm](https://arxiv.org/pdf/1909.08053.pdf).
The default distributed runtimes are [Ray](https://github.com/ray-project/ray) for multi-node inference and native Python `multiprocessing` for single-node inference. You can override the defaults by setting `distributed_executor_backend` in the `LLM` class or `--distributed-executor-backend` in the API server. Use `mp` for `multiprocessing` or `ray` for Ray.
For multi-GPU inference, set `tensor_parallel_size` in the `LLM` class to the desired GPU count. For example, to run inference on 4 GPUs:
```python
from vllm import LLM
@ -33,84 +40,96 @@ llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Francisco is a")
```
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
For multi-GPU serving, include `--tensor-parallel-size` when starting the server. For example, to run the API server on 4 GPUs:
```bash
vllm serve facebook/opt-13b \
--tensor-parallel-size 4
```
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
To enable pipeline parallelism, add `--pipeline-parallel-size`. For example, to run the API server on 8 GPUs with pipeline parallelism and tensor parallelism:
```bash
# Eight GPUs total
vllm serve gpt2 \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2
```
## Running vLLM on multiple nodes
## Multi-node deployment
If a single node does not have enough GPUs to hold the model, you can run the model using multiple nodes. It is important to make sure the execution environment is the same on all nodes, including the model path, the Python environment. The recommended way is to use docker images to ensure the same environment, and hide the heterogeneity of the host machines via mapping them into the same docker configuration.
If a single node lacks sufficient GPUs to hold the model, deploy vLLM across multiple nodes. Multi-node deployments require Ray as the runtime engine. Ensure that every node provides an identical execution environment, including the model path and Python packages. Using container images is recommended because they provide a convenient way to keep environments consistent and to hide host heterogeneity.
The first step, is to start containers and organize them into a cluster. We have provided the helper script <gh-file:examples/online_serving/run_cluster.sh> to start the cluster. Please note, this script launches docker without administrative privileges that would be required to access GPU performance counters when running profiling and tracing tools. For that purpose, the script can have `CAP_SYS_ADMIN` to the docker container by using the `--cap-add` option in the docker run command.
### Ray cluster setup with containers
Pick a node as the head node, and run the following command:
The helper script <gh-file:examples/online_serving/run_cluster.sh> starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
Choose one node as the head node and run:
```bash
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
<HEAD_NODE_IP> \
--head \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
-e VLLM_HOST_IP=<HEAD_NODE_IP>
```
On the rest of the worker nodes, run the following command:
On each worker node, run:
```bash
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
<HEAD_NODE_IP> \
--worker \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
-e VLLM_HOST_IP=<WORKER_NODE_IP>
```
Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
Note that `VLLM_HOST_IP` is unique for each worker. Keep the shells running these commands open; closing any shell terminates the cluster. Ensure that all nodes can communicate with each other through their IP addresses.
!!! warning
It is considered best practice to set `VLLM_HOST_IP` to an address on a private network segment for the vLLM cluster. The traffic sent here is not encrypted. The endpoints are also exchanging data in a format that could be exploited to execute arbitrary code should a malicious party gain access to the network. Please ensure that this network is not reachable by any untrusted parties.
!!! warning "Network security"
For security, set `VLLM_HOST_IP` to an address on a private network segment. Traffic sent over this network is unencrypted, and the endpoints exchange data in a format that can be exploited to execute arbitrary code if an adversary gains network access. Ensure that untrusted parties cannot reach the network.
!!! warning
Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
From any node, enter a container and run `ray status` and `ray list nodes` to verify that Ray finds the expected number of nodes and GPUs.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` and `ray list nodes` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
!!! tip
Alternatively, set up the Ray cluster using KubeRay. For more information, see [KubeRay vLLM documentation](https://docs.ray.io/en/latest/cluster/kubernetes/examples/vllm-rayservice.html).
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
### Running vLLM on a Ray cluster
!!! tip
If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it <container_name> /bin/bash`.
Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient.
The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs across 2 nodes (8 GPUs per node), set the tensor parallel size to 8 and the pipeline parallel size to 2:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
```
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
Alternatively, you can set `tensor_parallel_size` to the total number of GPUs in the cluster:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 16
```
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like InfiniBand. To correctly set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the InfiniBand is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses InfiniBand with GPUDirect RDMA, which is efficient.
## Troubleshooting distributed deployments
### GPUDirect RDMA
To make tensor parallelism performant, ensure that communication between nodes is efficient, for example, by using high-speed network cards such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Contact your system administrator for more information about the required flags. One way to confirm if InfiniBand is working is to run `vllm` with the `NCCL_DEBUG=TRACE` environment variable set, for example `NCCL_DEBUG=TRACE vllm serve ...`, and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, NCCL uses a raw TCP socket, which is not efficient for cross-node tensor parallelism. If you find `[send] via NET/IB/GDRDMA` in the logs, NCCL uses InfiniBand with GPUDirect RDMA, which is efficient.
To enable GPUDirect RDMA with vLLM, specific configuration tweaks are needed. This setup ensures:
## Enabling GPUDirect RDMA
- `IPC_LOCK` Security Context: Add the `IPC_LOCK` capability to the containers security context to lock memory pages and prevent swapping to disk.
- Shared Memory with `/dev/shm`: Mount `/dev/shm` in the pod spec to provide shared memory for IPC.
To enable GPUDirect RDMA with vLLM, configure the following settings:
When using Docker, you can set up the container as follows:
- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk.
- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC).
If you use Docker, set up the container as follows:
```bash
docker run --gpus all \
@ -120,7 +139,7 @@ docker run --gpus all \
vllm/vllm-openai
```
When using Kubernetes, you can set up the pod spec as follows:
If you use Kubernetes, set up the pod spec as follows:
```yaml
...
@ -146,13 +165,21 @@ spec:
...
```
!!! warning
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. To enable InfiniBand, append flags such as `--privileged -e NCCL_IB_HCA=mlx5` to `run_cluster.sh`. For cluster-specific settings, consult your system administrator.
!!! warning
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
To confirm InfiniBand operation, enable detailed NCCL logs:
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
```bash
NCCL_DEBUG=TRACE vllm serve ...
```
!!! warning
If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` and `ray list nodes` to see the IP address used by Ray. See <gh-issue:7815> for more information.
Search the logs for the transport method. Entries containing `[send] via NET/Socket` indicate raw TCP sockets, which perform poorly for cross-node tensor parallelism. Entries containing `[send] via NET/IB/GDRDMA` indicate InfiniBand with GPUDirect RDMA, which provides high performance.
!!! tip "Verify inter-node GPU communication"
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to `run_cluster.sh`, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>.
!!! tip "Pre-download Hugging Face models"
If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=<TOKEN>` to `run_cluster.sh`.
!!! tip
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>.

View File

@ -0,0 +1,244 @@
# Expert Parallel Deployment
vLLM supports Expert Parallelism (EP), which allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput overall.
EP is typically coupled with Data Parallelism (DP). While DP can be used independently of EP, EP is more efficient when used in conjunction with DP. You can read more about data parallelism [here](data_parallel_deployment.md).
## Prerequisites
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install UCX and NIXL following the [script](gh-file:tools/install_nixl.sh).
### Backend Selection Guide
vLLM provides three communication backends for EP:
| Backend | Use Case | Features | Best For |
|---------|----------|----------|----------|
| `pplx` | Single node | Chunked prefill support | Development, best for intra-node deployments |
| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout | High-throughput scenarios, prefill-dominated workloads |
| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout | Low-latency scenarios, decode-dominated workloads |
## Single Node Deployment
!!! warning
EP is an experimental feature. Argument names and default values may change in the future.
### Configuration
Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as:
```
EP_SIZE = TP_SIZE × DP_SIZE
```
Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now)
- `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically)
### Example Command
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
```bash
# Single node EP deployment with pplx backend
VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # Tensor parallelism across 1 GPU
--data-parallel-size 8 \ # Data parallelism across 8 processes
--enable-expert-parallel # Enable expert parallelism
```
## Multi-Node Deployment
For multi-node deployment, use the DeepEP communication kernel with one of two modes (see [Backend Selection Guide](#backend-selection-guide) above).
### Deployment Steps
1. **Run one command per node** - Each node requires its own launch command
2. **Configure networking** - Ensure proper IP addresses and port configurations
3. **Set node roles** - First node handles requests, additional nodes run in headless mode
### Example: 2-Node Deployment
The following example deploys `DeepSeek-V3-0324` across 2 nodes using `deepep_low_latency` mode:
```bash
# Node 1 (Primary - handles incoming requests)
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # TP size per node
--enable-expert-parallel \ # Enable EP
--data-parallel-size 16 \ # Total DP size across all nodes
--data-parallel-size-local 8 \ # Local DP size on this node (8 GPUs per node)
--data-parallel-address 192.168.1.100 \ # Replace with actual IP of Node 1
--data-parallel-rpc-port 13345 \ # RPC communication port, can be any port as long as reachable by all nodes
--api-server-count=8 # Number of API servers for load handling (scaling this out to total ranks are recommended)
# Node 2 (Secondary - headless mode, no API server)
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # TP size per node
--enable-expert-parallel \ # Enable EP
--data-parallel-size 16 \ # Total DP size across all nodes
--data-parallel-size-local 8 \ # Local DP size on this node
--data-parallel-start-rank 8 \ # Starting rank offset for this node
--data-parallel-address 192.168.1.100 \ # IP of primary node (Node 1)
--data-parallel-rpc-port 13345 \ # Same RPC port as primary
--headless # No API server, worker only
```
### Key Configuration Notes
- **Headless mode**: Secondary nodes run with `--headless` flag, meaning all client requests are handled by the primary node
- **Rank calculation**: `--data-parallel-start-rank` should equal the cumulative local DP size of previous nodes
- **Load scaling**: Adjust `--api-server-count` on the primary node to handle higher request loads
### Network Configuration
!!! important "InfiniBand Clusters"
On InfiniBand networked clusters, set this environment variable to prevent initialization hangs:
```bash
export GLOO_SOCKET_IFNAME=eth0
```
This ensures torch distributed group discovery uses Ethernet instead of InfiniBand for initial setup.
## Expert Parallel Load Balancer (EPLB)
While MoE models are typically trained so that each expert receives a similar number of tokens, in practice the distribution of tokens across experts can be highly skewed. vLLM provides an Expert Parallel Load Balancer (EPLB) to redistribute expert mappings across EP ranks, evening the load across experts.
### Configuration
Enable EPLB with the `--enable-eplb` flag.
!!! note "Model Support"
Currently only DeepSeek V3 architecture is supported.
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
### EPLB Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--eplb-window-size` | Number of engine steps to track for rebalancing decisions | - |
| `--eplb-step-interval` | Frequency of rebalancing (every N engine steps) | - |
| `--eplb-log-balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `--num-redundant-experts` | Additional global experts per EP rank beyond equal distribution | `0` |
### Expert Distribution Formula
- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
### Example Command
Single node deployment with EPLB enabled:
```bash
# Single node with EPLB load balancing
VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \ # Tensor parallelism
--data-parallel-size 8 \ # Data parallelism
--enable-expert-parallel \ # Enable EP
--enable-eplb \ # Enable load balancer
--eplb-log-balancedness \ # Log balancing metrics
--eplb-window-size 1000 \ # Track last 1000 engine steps
--eplb-step-interval 3000 # Rebalance every 3000 steps
```
For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--num-redundant-experts` to 32 in large scale use cases so the most popular experts are always available.
## Disaggregated Serving (Prefill/Decode Split)
For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.
### Architecture Overview
- **Prefill Instance**: Uses `deepep_high_throughput` backend for optimal prefill performance
- **Decode Instance**: Uses `deepep_low_latency` backend for minimal decode latency
- **KV Cache Transfer**: Connects instances via NIXL or other KV connectors
### Setup Steps
1. **Install KV Connector**: Install NIXL using the [installation script](gh-file:tools/install_nixl.sh)
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`
3. **Client Orchestration**: Use the client-side script below to coordinate prefill/decode operations. We are actively working on routing solutions.
### Client Orchestration Example
```python
from openai import OpenAI
import uuid
try:
# 1: Set up clients for prefill and decode instances
openai_api_key = "EMPTY" # vLLM doesn't require a real API key
# Replace these IP addresses with your actual instance addresses
prefill_client = OpenAI(
api_key=openai_api_key,
base_url="http://192.168.1.100:8000/v1", # Prefill instance URL
)
decode_client = OpenAI(
api_key=openai_api_key,
base_url="http://192.168.1.101:8001/v1", # Decode instance URL
)
# Get model name from prefill instance
models = prefill_client.models.list()
model = models.data[0].id
print(f"Using model: {model}")
# 2: Prefill Phase
# Generate unique request ID to link prefill and decode operations
request_id = str(uuid.uuid4())
print(f"Request ID: {request_id}")
prefill_response = prefill_client.completions.create(
model=model,
# Prompt must exceed vLLM's block size (16 tokens) for PD to work
prompt="Write a detailed explanation of Paged Attention for Transformers works including the management of KV cache for multi-turn conversations",
max_tokens=1, # Force prefill-only operation
extra_body={
"kv_transfer_params": {
"do_remote_decode": True, # Enable remote decode
"do_remote_prefill": False, # This is the prefill instance
"remote_engine_id": None, # Will be populated by vLLM
"remote_block_ids": None, # Will be populated by vLLM
"remote_host": None, # Will be populated by vLLM
"remote_port": None # Will be populated by vLLM
}
},
extra_headers={"X-Request-Id": request_id}
)
print("-" * 50)
print("✓ Prefill completed successfully")
print(f"Prefill response: {prefill_response.choices[0].text}")
# 3: Decode Phase
# Transfer KV cache parameters from prefill to decode instance
decode_response = decode_client.completions.create(
model=model,
prompt="This prompt is ignored during decode", # Original prompt not needed
max_tokens=150, # Generate up to 150 tokens
extra_body={
"kv_transfer_params": prefill_response.kv_transfer_params # Pass KV cache info
},
extra_headers={"X-Request-Id": request_id} # Same request ID
)
print("-" * 50)
print("✓ Decode completed successfully")
print(f"Final response: {decode_response.choices[0].text}")
except Exception as e:
print(f"❌ Error during disaggregated serving: {e}")
print("Check that both prefill and decode instances are running and accessible")
```

View File

@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
We currently support the following OpenAI APIs:
- [Completions API][completions-api] (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- Only applicable to [text generation models](../models/generative_models.md).
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- Only applicable to [embedding models](../models/pooling_models.md).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API][translations-api] (`/v1/audio/translations`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
In addition, we have the following custom APIs:
@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
- [Pooling API][pooling-api] (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- Only applicable to [classification models](../models/pooling_models.md).
- [Score API][score-api] (`/score`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Only applicable to [cross-encoder models](../models/pooling_models.md).
[](){ #chat-template }
@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model:
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
```
!!! important
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model:
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
```
!!! important
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>

View File

@ -2,10 +2,14 @@
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.
vLLM can be used to generate the completions for RLHF. The best way to do this is with libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl).
vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like [TRL](https://github.com/huggingface/trl), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [verl](https://github.com/volcengine/verl) and [unsloth](https://github.com/unslothai/unsloth).
See the following basic examples to get started if you don't want to use an existing library:
- [Training and inference processes are located on separate GPUs (inspired by OpenRLHF)](../examples/offline_inference/rlhf.md)
- [Training and inference processes are colocated on the same GPUs using Ray](../examples/offline_inference/rlhf_colocate.md)
- [Utilities for performing RLHF with vLLM](../examples/offline_inference/rlhf_utils.md)
See the following notebooks showing how to use vLLM for GRPO:
- [Qwen-3 4B GRPO using Unsloth + vLLM](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)

View File

@ -190,6 +190,37 @@ def run_phi4mm(question: str, audio_count: int) -> ModelRequestData:
)
def run_phi4_multimodal(question: str, audio_count: int) -> ModelRequestData:
"""
Phi-4-multimodal-instruct supports both image and audio inputs. Here, we
show how to process audio inputs.
"""
model_path = snapshot_download(
"microsoft/Phi-4-multimodal-instruct", revision="refs/pr/70"
)
# Since the vision-lora and speech-lora co-exist with the base model,
# we have to manually specify the path of the lora weights.
speech_lora_path = os.path.join(model_path, "speech-lora")
placeholders = "<|audio|>" * audio_count
prompts = f"<|user|>{placeholders}{question}<|end|><|assistant|>"
engine_args = EngineArgs(
model=model_path,
max_model_len=12800,
max_num_seqs=2,
enable_lora=True,
max_lora_rank=320,
limit_mm_per_prompt={"audio": audio_count},
)
return ModelRequestData(
engine_args=engine_args,
prompt=prompts,
lora_requests=[LoRARequest("speech", 1, speech_lora_path)],
)
# Qwen2-Audio
def run_qwen2_audio(question: str, audio_count: int) -> ModelRequestData:
model_name = "Qwen/Qwen2-Audio-7B-Instruct"
@ -303,6 +334,7 @@ model_example_map = {
"granite_speech": run_granite_speech,
"minicpmo": run_minicpmo,
"phi4_mm": run_phi4mm,
"phi4_multimodal": run_phi4_multimodal,
"qwen2_audio": run_qwen2_audio,
"qwen2_5_omni": run_qwen2_5_omni,
"ultravox": run_ultravox,

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True
model="jason9693/Qwen2.5-1.5B-apeach",
runner="pooling",
enforce_eager=True,
)
return parser.parse_args()
@ -27,7 +29,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="classify" for classification models
# You should pass runner="pooling" for classification models
llm = LLM(**vars(args))
# Generate logits. The output is a list of ClassificationRequestOutputs.

View File

@ -13,7 +13,7 @@ def parse_args():
# Set example specific arguments
parser.set_defaults(
model="intfloat/e5-mistral-7b-instruct",
task="embed",
runner="pooling",
enforce_eager=True,
max_model_len=1024,
)
@ -30,7 +30,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="embed" for embedding models
# You should pass runner="pooling" for embedding models
llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
model="BAAI/bge-reranker-v2-m3",
runner="pooling",
enforce_eager=True,
)
return parser.parse_args()
@ -26,7 +28,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="score" for cross-encoder models
# You should pass runner="pooling" for cross-encoder models
llm = LLM(**vars(args))
# Generate scores. The output is a list of ScoringRequestOutputs.

Some files were not shown because too many files have changed in this diff Show More