Compare commits

..

79 Commits

Author SHA1 Message Date
bfff9bcd1d [V1] TPU - Remove self.kv_caches 2025-03-05 20:42:05 +00:00
257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-03-05 14:18:55 +00:00
47d4a7e004 Small update for external_launcher backend docs (#14288) 2025-03-05 21:30:00 +08:00
7f89a594dd [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-05 12:29:50 +00:00
961644e6a8 [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217)
Signed-off-by: Iacopo Poli <iacopo@lighton.ai>
2025-03-05 11:44:10 +00:00
8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 08:49:44 +00:00
ec79b67c77 [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-05 07:37:16 +00:00
32985bed7c [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-03-05 06:30:40 +00:00
dae9ec464c Temporarily disable test_awq_gemm_opcheck (#14251)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 06:10:35 +00:00
6eaf93020d [platforms] improve rocm debugging info (#14257) 2025-03-04 21:32:18 -08:00
72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) 2025-03-04 21:27:26 -08:00
0a995d5434 [Model] New model support for Phi-4-multimodal-instruct (#14119) 2025-03-04 20:57:01 -08:00
ade3f7d988 [V1][Bugfix] Do not reset prefix caching metrics (#14235) 2025-03-05 04:39:13 +00:00
0df25101d6 [Bugfix] Fix gptq_marlin for deepseek-v3 (#13750)
Signed-off-by: dangshunya <dangshunya@baichuan-inc.com>
Co-authored-by: dangshunya <dangshunya@baichuan-inc.com>
2025-03-05 12:25:53 +08:00
e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 12:25:24 +08:00
5b143d33be Moved numba from common requirements to cuda/rocm specific requirements (#14199)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2025-03-05 12:25:00 +08:00
eb59b5a6cb [misc] announce china meetup (#14248)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-05 10:33:50 +08:00
fbfc3ee37e [V1][TPU] TPU multimodal model support for ragged attention (#14158)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-03-04 19:58:48 -05:00
3e1d223626 [ROCm] Disable a few more kernel tests that are broken on ROCm (#14145)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-04 23:37:55 +00:00
4f5b059f14 Clean up unused padding_idx variables across many model definitions (#13240)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-04 21:27:00 +00:00
288ca110f6 [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (#14228)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-03-04 21:10:32 +00:00
c2bd2196fc [v1][Metrics] Add design doc (#12745)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-04 20:36:55 +00:00
550c7ba3dc [Docs] Update Dockerfile dependency image (#14215)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-04 20:22:11 +00:00
e5b2f1601a [Frontend] Do prompt_logprobs clamping for chat as well as completions (#14225)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-04 20:13:06 +00:00
9badee53de Fix performance when --generation-config is not None (#14223)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-04 20:59:22 +01:00
beebf4742a [TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-04 14:40:06 -05:00
f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) 2025-03-04 07:55:07 -08:00
b3cf368d79 [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (#14161) 2025-03-04 15:43:59 +00:00
c8525f06fc [V0][Metrics] Deprecate some questionable request time metrics (#14135)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-04 15:11:33 +00:00
5db6b2c961 [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-04 15:06:47 +00:00
6247bae6c6 [Bugfix] Restrict MacOS CPU detection (#14210)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-04 22:25:27 +08:00
3610fb4930 [doc] add "Failed to infer device type" to faq (#14200)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 20:47:06 +08:00
71c4b40562 [sleep mode] error out with expandable_segments (#14189)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 18:54:19 +08:00
ac65bc92df [platform] add debug logging during inferring the device type (#14195)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 18:39:16 +08:00
f78c0be80a Fix benchmark_moe.py tuning for CUDA devices (#14164) 2025-03-03 21:11:03 -08:00
66233af7b6 Use math.prod instead of np.prod for trivial ops (#14142) 2025-03-03 21:09:22 -08:00
bf13d40972 [core] Pass all driver env vars to ray workers unless excluded (#14099)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-04 11:44:17 +08:00
989f4f430c [Misc] Remove lru_cache in NvmlCudaPlatform (#14156)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-04 11:09:34 +08:00
bb5b640359 [core] moe fp8 block quant tuning support (#14068)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-03-04 01:30:23 +00:00
c060b71408 [Model] Add support for GraniteMoeShared models (#13313)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-04 08:04:52 +08:00
79e4937c65 [v1] Add comments to the new ragged paged attention Pallas kernel (#14155)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-03-03 23:00:55 +00:00
cd1d3c3df8 [Docs] Add GPTQModel (#14056)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-03 21:59:09 +00:00
19d98e0c7d [Kernel] Optimize moe intermediate_cache usage (#13625)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-03 16:29:53 -05:00
2b04c209ee [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-03 14:20:24 -07:00
ae122b1cbd [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 19:04:45 +00:00
872db2be0e [V1] Simplify stats logging (#14082)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-03 10:34:14 -08:00
2dfdfed8a0 [V0][Metrics] Deprecate some KV/prefix cache metrics (#14136)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 18:25:46 +00:00
c41d27156b [V0][Metrics] Remove unimplemented vllm:tokens_total (#14134)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 17:50:22 +00:00
91373a0d15 Fix head_dim not existing in all model configs (Transformers backend) (#14141)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-03 17:48:11 +00:00
848a6438ae [ROCm] Faster Custom Paged Attention kernels (#12348) 2025-03-03 09:24:45 -08:00
98175b2816 Improve the docs for TransformersModel (#14147)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-03 17:03:05 +00:00
4167252eaf [V1] Refactor parallel sampling support (#13774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 08:15:27 -08:00
f35f8e2242 [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-03 16:43:14 +08:00
b87c21fc89 [Misc][Platform] Move use allgather to platform (#14010)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-03 15:40:04 +08:00
e584b85afd [Misc] duplicate code in deepseek_v2 (#14106) 2025-03-03 14:10:11 +08:00
09e56f9262 [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) 2025-03-02 17:35:01 -08:00
cf069aa8aa Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
bf33700ecd [v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-02 14:49:42 -05:00
bc6ccb9878 [Doc] Source building add clone step (#14086)
Signed-off-by: qux-bbb <1147635419@qq.com>
2025-03-02 10:59:50 +00:00
82fbeae92b [Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-01 17:20:30 -08:00
cc5e8f6db8 [Model] Add LoRA support for TransformersModel (#13770)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-02 09:17:34 +08:00
d54990da47 [v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) 2025-03-01 20:46:02 +00:00
b9f1d4294e [v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) 2025-03-01 08:25:54 +00:00
b28246f6ff [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-01 07:18:32 +00:00
3b5567a209 [V1][Minor] Do not print attn backend twice (#13985)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-01 07:09:14 +00:00
fdcc405346 [Doc] Consolidate whisper and florence2 examples (#14050) 2025-02-28 22:49:15 -08:00
8994dabc22 [Documentation] Add more deployment guide for Kubernetes deployment (#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-03-01 06:44:24 +00:00
02296f420d [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) 2025-02-28 22:31:01 -08:00
6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) 2025-02-28 22:30:59 -08:00
6a84164add [Bugfix] Add file lock for ModelScope download (#14060)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-01 06:10:28 +00:00
f64ffa8c25 [Docs] Add pipeline_parallel_size to optimization docs (#14059)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-01 05:43:54 +00:00
bd56c983d6 [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
2025-02-28 16:20:11 -07:00
084bbac8cc [core] Bump ray to 2.43 (#13994)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-28 21:47:44 +00:00
28943d36ce [v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-28 20:53:31 +00:00
b526ca6726 Add RELEASE.md (#13926)
Signed-off-by: atalman <atalman@fb.com>
2025-02-28 12:25:50 -08:00
e7bd944e08 [v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-02-28 19:03:16 +00:00
c3b6559a10 [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-28 11:01:36 -07:00
4be4b26cb7 Fix entrypoint tests for embedding models (#14052) 2025-02-28 08:56:44 -08:00
2aed2c9fa7 [Doc] Fix ROCm documentation (#14041)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-28 16:42:07 +00:00
474 changed files with 18084 additions and 5715 deletions

View File

@ -77,7 +77,6 @@ echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
commands="${commands} \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
@ -94,7 +93,12 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py \
--ignore=kernels/test_cascade_flash_attn.py \
--ignore=kernels/test_mamba_mixer2.py"
--ignore=kernels/test_mamba_mixer2.py \
--ignore=kernels/test_aqlm.py \
--ignore=kernels/test_machete_mm.py \
--ignore=kernels/test_mha_attn.py \
--ignore=kernels/test_block_fp8.py \
--ignore=kernels/test_permute_cols.py"
fi
#ignore certain Entrypoints tests

View File

@ -275,7 +275,7 @@ steps:
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py --ignore=lora/test_transfomers_model.py
parallelism: 4
- label: PyTorch Fullgraph Smoke Test # 9min
@ -589,6 +589,7 @@ steps:
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_minicpmv_tp.py
- pytest -v -s -x lora/test_transfomers_model.py
- label: Weight Loading Multiple GPU Test # 33min

View File

@ -23,7 +23,7 @@ updates:
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "ray[cgraph]" # Ray Compiled Graph
- dependency-name: "lm-eval"
groups:
minor-update:

48
CMakeLists.txt Executable file → Normal file
View File

@ -31,7 +31,7 @@ set(ignoreMe "${VLLM_PYTHON_PATH}")
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
# Supported NVIDIA architectures.
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0")
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101")
@ -297,7 +297,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)
set(MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu"
@ -317,9 +317,25 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
" in CUDA target architectures")
endif()
# Only build AllSpark kernels if we are building for at least some compatible archs.
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9" "${CUDA_ARCHS}")
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND ALLSPARK_ARCHS)
set(ALLSPARK_SRCS
"csrc/quantization/gptq_allspark/allspark_repack.cu"
"csrc/quantization/gptq_allspark/allspark_qgemm_w8a16.cu")
set_gencode_flags_for_srcs(
SRCS "${ALLSPARK_SRCS}"
CUDA_ARCHS "${ALLSPARK_ARCHS}")
list(APPEND VLLM_EXT_SRC "${ALLSPARK_SRCS}")
message(STATUS "Building AllSpark kernels for archs: ${ALLSPARK_ARCHS}")
else()
message(STATUS "Not building AllSpark kernels as no compatible archs found"
" in CUDA target architectures, or CUDA not >= 12.0")
endif()
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a;10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
@ -353,7 +369,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
# kernels for the remaining archs that are not already built for 3x.
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
"7.5;8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
"7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
# subtract out the archs that are already built for 3x
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
if (SCALED_MM_2X_ARCHS)
@ -378,7 +394,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# 2:4 Sparse Kernels
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
# require CUDA 12.2 or later (and only work on Hopper and Blackwell).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
@ -403,8 +419,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND FP4_ARCHS)
set(SRCS
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu"
)
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${FP4_ARCHS}")
@ -417,6 +432,22 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
set(FP4_ARCHS)
endif()
# FP8 Blackwell Archs
cuda_archs_loose_intersection(BLACKWELL_ARCHS "10.0;10.1;12.0" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND BLACKWELL_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8.cu"
)
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${BLACKWELL_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
message(STATUS "Building FP8 for archs: ${BLACKWELL_ARCHS}")
else()
# clear BLACKWELL_ARCHS
set(BLACKWELL_ARCHS)
endif()
#
# Machete kernels
@ -498,6 +529,7 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
INCLUDE_DIRECTORIES ${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
USE_SABI 3
WITH_SOABI)
@ -521,7 +553,7 @@ set_gencode_flags_for_srcs(
CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
if (MARLIN_MOE_ARCHS)
set(MARLIN_MOE_SRC
"csrc/moe/marlin_kernels/marlin_moe_kernel.h"

View File

@ -15,7 +15,11 @@ Easy, fast, and cheap LLM serving for everyone
---
We are excited to invite you to our Menlo Park meetup with Meta, evening of Thursday, February 27! Meta engineers will discuss the improvements on top of vLLM, and vLLM contributors will share updates from the v0.7.x series of releases. [Register Now](https://lu.ma/h7g3kuj9)
Were excited to invite you to the first **vLLM China Meetup** on **March 16** in **Beijing**!
Join us to connect with the **vLLM team** and explore how vLLM is leveraged in **post-training, fine-tuning, and deployment**, including [verl](https://github.com/volcengine/verl), [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), and [vllm-ascend](https://github.com/vllm-project/vllm-ascend).
👉 **[Register Now](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)** to be part of the discussion!
---

54
RELEASE.md Normal file
View File

@ -0,0 +1,54 @@
# Releasing vLLM
vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
## Release Versioning
vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released.
* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0.
* _minor_ major features
* _patch_ features and backwards-compatible bug fixes
* _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release
## Release Cadence
Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release.
Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional.
| Release Date | Patch release versions | Post Release versions |
| --- | --- | --- |
| Jan 2025 | 0.7.0 | --- |
| Feb 2025 | 0.7.1, 0.7.2, 0.7.3 | --- |
| Mar 2025 | 0.7.4, 0.7.5 | --- |
| Apr 2025 | 0.7.6, 0.7.7 | --- |
| May 2025 | 0.7.8, 0.7.9 | --- |
| Jun 2025 | 0.7.10, 0.7.11 | --- |
| Jul 2025 | 0.7.12, 0.7.13 | --- |
| Aug 2025 | 0.7.14, 0.7.15 | --- |
| Sep 2025 | 0.7.16, 0.7.17 | --- |
| Oct 2025 | 0.7.18, 0.7.19 | --- |
| Nov 2025 | 0.7.20, 0.7.21 | --- |
| Dec 2025 | 0.7.22, 0.7.23 | --- |
## Release branch
Each release is built from a dedicated release branch.
* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live.
* For post releases, previously cut release branch is reused
* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release.
* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets.
* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch.
## Release Cherry-Pick Criteria
After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.
* Regression fixes - that address functional/performance regression against the most recent release (e.g. 0.7.0 for 0.7.1 release)
* Critical fixes - critical fixes for severe issue such as silent incorrectness, backwards compatibility, crashes, deadlocks, (large) memory leaks
* Fixes to new features introduced in the most recent release (e.g. 0.7.0 for 0.7.1 release)
* Documentation improvements
* Release branch specific changes (e.g. change version identifiers or CI fixes)
Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes.

View File

@ -6,7 +6,7 @@ import sys
import time
import traceback
from dataclasses import dataclass, field
from typing import List, Optional, Union
from typing import Optional, Union
import aiohttp
import huggingface_hub.constants
@ -14,6 +14,8 @@ from tqdm.asyncio import tqdm
from transformers import (AutoTokenizer, PreTrainedTokenizer,
PreTrainedTokenizerFast)
from vllm.model_executor.model_loader.weight_utils import get_lock
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
@ -39,8 +41,8 @@ class RequestFuncOutput:
latency: float = 0.0
output_tokens: int = 0
ttft: float = 0.0 # Time to first token
itl: List[float] = field(
default_factory=list) # List of inter-token latencies
itl: list[float] = field(
default_factory=list) # list of inter-token latencies
tpot: float = 0.0 # avg next-token latencies
prompt_len: int = 0
error: str = ""
@ -430,12 +432,15 @@ def get_model(pretrained_model_name_or_path: str) -> str:
if os.getenv('VLLM_USE_MODELSCOPE', 'False').lower() == 'true':
from modelscope import snapshot_download
model_path = snapshot_download(
model_id=pretrained_model_name_or_path,
local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
ignore_file_pattern=[".*.pt", ".*.safetensors", ".*.bin"])
# Use file lock to prevent multiple processes from
# downloading the same model weights at the same time.
with get_lock(pretrained_model_name_or_path):
model_path = snapshot_download(
model_id=pretrained_model_name_or_path,
local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
ignore_file_pattern=[".*.pt", ".*.safetensors", ".*.bin"])
return model_path
return model_path
return pretrained_model_name_or_path

View File

@ -6,7 +6,6 @@ import json
import os
import random
import time
from typing import List
import datasets
import pandas as pd
@ -39,7 +38,7 @@ class SampleRequest:
completion: str = None
def run_vllm(requests: List[SampleRequest],
def run_vllm(requests: list[SampleRequest],
engine_args: EngineArgs,
n: int,
guided_decoding_rate: float = 1.0,
@ -54,8 +53,8 @@ def run_vllm(requests: List[SampleRequest],
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[str] = []
sampling_params: List[SamplingParams] = []
prompts: list[str] = []
sampling_params: list[SamplingParams] = []
# create a list containing random selected true or false
guided_decoding_req_idx = random.sample(
range(len(requests)), int(len(requests) * guided_decoding_rate))
@ -110,7 +109,7 @@ def run_vllm(requests: List[SampleRequest],
async def run_vllm_async(
requests: List[SampleRequest],
requests: list[SampleRequest],
engine_args: AsyncEngineArgs,
n: int,
guided_decoding_rate: float = 1.0,
@ -129,8 +128,8 @@ async def run_vllm_async(
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[str] = []
sampling_params: List[SamplingParams] = []
prompts: list[str] = []
sampling_params: list[SamplingParams] = []
guided_decoding_req_idx = random.sample(
range(len(requests)), int(len(requests) * guided_decoding_rate))
@ -203,7 +202,7 @@ async def run_vllm_async(
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
args: argparse.Namespace) -> list[SampleRequest]:
if args.dataset == 'json':
if args.json_schema_path is None:
dir_path = os.path.dirname(os.path.realpath(__file__))
@ -287,7 +286,7 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
elif args.dataset == "xgrammar_bench":
args.warmup = False
requests: List[SampleRequest] = []
requests: list[SampleRequest] = []
dataset = datasets.load_dataset("NousResearch/json-mode-eval",
split="train")
print(f"dataset has {len(dataset)} entries")

View File

@ -7,7 +7,7 @@ import json
import os
import time
from pathlib import Path
from typing import Any, Dict, List, Optional
from typing import Any, Optional
import numpy as np
import torch
@ -22,7 +22,7 @@ from vllm.utils import FlexibleArgumentParser
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: Dict[str, Any]) -> None:
results: dict[str, Any]) -> None:
pt_records = convert_to_pytorch_benchmark_format(
args=args,
metrics={"latency": results["latencies"]},
@ -57,7 +57,7 @@ def main(args: argparse.Namespace):
dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size,
args.input_len))
dummy_prompts: List[PromptType] = [{
dummy_prompts: list[PromptType] = [{
"prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()]

View File

@ -31,7 +31,7 @@ import dataclasses
import json
import random
import time
from typing import List, Optional, Tuple
from typing import Optional
from transformers import PreTrainedTokenizerBase
@ -77,9 +77,9 @@ def sample_requests_from_dataset(
dataset_path: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
input_length_range: Tuple[int, int],
input_length_range: tuple[int, int],
fixed_output_len: Optional[int],
) -> List[Request]:
) -> list[Request]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
@ -99,7 +99,7 @@ def sample_requests_from_dataset(
assert min_len >= 0 and max_len >= min_len, "input_length_range too small"
# Filter out sequences that are too long or too short
filtered_requests: List[Request] = []
filtered_requests: list[Request] = []
for i in range(len(dataset)):
if len(filtered_requests) == num_requests:
@ -122,10 +122,10 @@ def sample_requests_from_dataset(
def sample_requests_from_random(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
input_length_range: Tuple[int, int],
input_length_range: tuple[int, int],
fixed_output_len: Optional[int],
prefix_len: int,
) -> List[Request]:
) -> list[Request]:
requests = []
prefix_token_ids = sample_tokens(tokenizer, prefix_len)
@ -144,9 +144,9 @@ def sample_requests_from_random(
return requests
def repeat_and_sort_requests(requests: List[Request],
def repeat_and_sort_requests(requests: list[Request],
repeat_count: int,
sort: bool = False) -> List[str]:
sort: bool = False) -> list[str]:
repeated_requests = requests * repeat_count
if sort:
repeated_requests.sort(key=lambda x: x[1])

View File

@ -5,7 +5,7 @@ import dataclasses
import json
import random
import time
from typing import List, Optional, Tuple
from typing import Optional
from transformers import AutoTokenizer, PreTrainedTokenizerBase
@ -23,7 +23,7 @@ def sample_requests(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
) -> list[tuple[str, int, int]]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
@ -40,7 +40,7 @@ def sample_requests(
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
filtered_dataset: list[tuple[str, int, int]] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
break
@ -68,7 +68,7 @@ def sample_requests(
def run_vllm(
requests: List[Tuple[str, int, int]],
requests: list[tuple[str, int, int]],
n: int,
engine_args: EngineArgs,
) -> float:

View File

@ -33,9 +33,10 @@ import os
import random
import time
import warnings
from collections.abc import AsyncGenerator, Collection
from dataclasses import dataclass
from datetime import datetime
from typing import Any, AsyncGenerator, Collection, Dict, List, Optional, Tuple
from typing import Any, Optional
import numpy as np
import pandas as pd
@ -73,22 +74,22 @@ class BenchmarkMetrics:
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
percentiles_ttft_ms: List[Tuple[float, float]]
percentiles_ttft_ms: list[tuple[float, float]]
mean_tpot_ms: float
median_tpot_ms: float
std_tpot_ms: float
percentiles_tpot_ms: List[Tuple[float, float]]
percentiles_tpot_ms: list[tuple[float, float]]
mean_itl_ms: float
median_itl_ms: float
std_itl_ms: float
percentiles_itl_ms: List[Tuple[float, float]]
percentiles_itl_ms: list[tuple[float, float]]
# E2EL stands for end-to-end latency per request.
# It is the time taken on the client side from sending
# a request to receiving a complete response.
mean_e2el_ms: float
median_e2el_ms: float
std_e2el_ms: float
percentiles_e2el_ms: List[Tuple[float, float]]
percentiles_e2el_ms: list[tuple[float, float]]
def sample_sharegpt_requests(
@ -96,7 +97,7 @@ def sample_sharegpt_requests(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, int, int, None]]:
) -> list[tuple[str, int, int, None]]:
# Load the dataset.
with open(dataset_path, encoding='utf-8') as f:
dataset = json.load(f)
@ -110,7 +111,7 @@ def sample_sharegpt_requests(
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
filtered_dataset: list[tuple[str, int, int]] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
break
@ -139,7 +140,7 @@ def sample_burstgpt_requests(
num_requests: int,
random_seed: int,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, int, int, None]]:
) -> list[tuple[str, int, int, None]]:
df = pd.read_csv(dataset_path)
gpt4_df = df[df["Model"] == "GPT-4"]
# Remove the failed requests (i.e., response length is 0)
@ -170,7 +171,7 @@ def sample_sonnet_requests(
output_len: int,
prefix_len: int,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, str, int, int, None]]:
) -> list[tuple[str, str, int, int, None]]:
assert (
input_len > prefix_len
), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'."
@ -211,7 +212,7 @@ def sample_sonnet_requests(
prefix_lines = poem_lines[:num_prefix_lines]
# Sample the rest of lines per request.
sampled_requests: List[Tuple[str, int, int]] = []
sampled_requests: list[tuple[str, int, int]] = []
for _ in range(num_requests):
num_lines_needed = num_input_lines - num_prefix_lines
sampled_lines = "".join(prefix_lines +
@ -238,8 +239,8 @@ def sample_vision_arena_requests(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
sampled_requests: List[Tuple[str, int, int, Dict[str,
) -> list[tuple[str, str, int, Optional[dict[str, Collection[str]]]]]:
sampled_requests: list[tuple[str, int, int, dict[str,
Collection[str]]]] = []
for data in dataset:
if len(sampled_requests) == num_requests:
@ -285,7 +286,7 @@ def sample_hf_requests(
tokenizer: PreTrainedTokenizerBase,
random_seed: int,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
) -> list[tuple[str, str, int, Optional[dict[str, Collection[str]]]]]:
# Special case for vision_arena dataset
if dataset_path == 'lmarena-ai/vision-arena-bench-v0.1' \
@ -307,7 +308,7 @@ def sample_hf_requests(
"HF Dataset must have 'conversations' column.")
filter_func = lambda x: len(x["conversations"]) >= 2
filtered_dataset = dataset.shuffle(seed=random_seed).filter(filter_func)
sampled_requests: List[Tuple[str, int, int, Dict[str,
sampled_requests: list[tuple[str, int, int, dict[str,
Collection[str]]]] = []
for data in filtered_dataset:
if len(sampled_requests) == num_requests:
@ -370,7 +371,7 @@ def sample_random_requests(
num_prompts: int,
range_ratio: float,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, int, int]]:
) -> list[tuple[str, int, int]]:
prefix_token_ids = np.random.randint(0,
tokenizer.vocab_size,
size=prefix_len).tolist()
@ -399,10 +400,10 @@ def sample_random_requests(
async def get_request(
input_requests: List[Tuple[str, int, int]],
input_requests: list[tuple[str, int, int]],
request_rate: float,
burstiness: float = 1.0,
) -> AsyncGenerator[Tuple[str, int, int], None]:
) -> AsyncGenerator[tuple[str, int, int], None]:
"""
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness.
@ -443,23 +444,23 @@ async def get_request(
def calculate_metrics(
input_requests: List[Tuple[str, int, int]],
outputs: List[RequestFuncOutput],
input_requests: list[tuple[str, int, int]],
outputs: list[RequestFuncOutput],
dur_s: float,
tokenizer: PreTrainedTokenizerBase,
selected_percentile_metrics: List[str],
selected_percentiles: List[float],
goodput_config_dict: Dict[str, float],
) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens: List[int] = []
selected_percentile_metrics: list[str],
selected_percentiles: list[float],
goodput_config_dict: dict[str, float],
) -> tuple[BenchmarkMetrics, list[int]]:
actual_output_lens: list[int] = []
total_input = 0
completed = 0
good_completed = 0
itls: List[float] = []
tpots: List[float] = []
all_tpots: List[float] = []
ttfts: List[float] = []
e2els: List[float] = []
itls: list[float] = []
tpots: list[float] = []
all_tpots: list[float] = []
ttfts: list[float] = []
e2els: list[float] = []
for i in range(len(outputs)):
if outputs[i].success:
output_len = outputs[i].output_tokens
@ -557,19 +558,19 @@ async def benchmark(
model_id: str,
model_name: str,
tokenizer: PreTrainedTokenizerBase,
input_requests: List[Tuple[str, int, int]],
input_requests: list[tuple[str, int, int]],
logprobs: Optional[int],
best_of: int,
request_rate: float,
burstiness: float,
disable_tqdm: bool,
profile: bool,
selected_percentile_metrics: List[str],
selected_percentiles: List[str],
selected_percentile_metrics: list[str],
selected_percentiles: list[str],
ignore_eos: bool,
goodput_config_dict: Dict[str, float],
goodput_config_dict: dict[str, float],
max_concurrency: Optional[int],
lora_modules: Optional[List[str]],
lora_modules: Optional[list[str]],
):
if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend]
@ -652,7 +653,7 @@ async def benchmark(
pbar=pbar)
benchmark_start_time = time.perf_counter()
tasks: List[asyncio.Task] = []
tasks: list[asyncio.Task] = []
async for request in get_request(input_requests, request_rate, burstiness):
prompt, prompt_len, output_len, mm_content = request
req_model_id, req_model_name = model_id, model_name
@ -674,7 +675,7 @@ async def benchmark(
asyncio.create_task(
limited_request_func(request_func_input=request_func_input,
pbar=pbar)))
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")
@ -820,7 +821,7 @@ def parse_goodput(slo_pairs):
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: Dict[str, Any],
results: dict[str, Any],
file_name: str) -> None:
metrics = [
"median_ttft_ms", "mean_ttft_ms", "std_ttft_ms", "p99_ttft_ms",
@ -974,7 +975,7 @@ def main(args: argparse.Namespace):
# Save config and results to json
if args.save_result:
result_json: Dict[str, Any] = {}
result_json: dict[str, Any] = {}
# Setup
current_dt = datetime.now().strftime("%Y%m%d-%H%M%S")

View File

@ -30,8 +30,9 @@ import os
import random
import time
import warnings
from collections.abc import AsyncGenerator
from dataclasses import dataclass
from typing import AsyncGenerator, Dict, List, Optional, Tuple
from typing import Optional
import datasets
import numpy as np
@ -66,22 +67,22 @@ class BenchmarkMetrics:
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
percentiles_ttft_ms: List[Tuple[float, float]]
percentiles_ttft_ms: list[tuple[float, float]]
mean_tpot_ms: float
median_tpot_ms: float
std_tpot_ms: float
percentiles_tpot_ms: List[Tuple[float, float]]
percentiles_tpot_ms: list[tuple[float, float]]
mean_itl_ms: float
median_itl_ms: float
std_itl_ms: float
percentiles_itl_ms: List[Tuple[float, float]]
percentiles_itl_ms: list[tuple[float, float]]
# E2EL stands for end-to-end latency per request.
# It is the time taken on the client side from sending
# a request to receiving a complete response.
mean_e2el_ms: float
median_e2el_ms: float
std_e2el_ms: float
percentiles_e2el_ms: List[Tuple[float, float]]
percentiles_e2el_ms: list[tuple[float, float]]
@dataclasses.dataclass
@ -104,7 +105,7 @@ class SampleRequest:
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
args: argparse.Namespace) -> list[SampleRequest]:
if args.dataset == 'json':
if args.json_schema_path is None:
dir_path = os.path.dirname(os.path.realpath(__file__))
@ -187,7 +188,7 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
]
elif args.dataset == "xgrammar_bench":
requests: List[SampleRequest] = []
requests: list[SampleRequest] = []
dataset = datasets.load_dataset("NousResearch/json-mode-eval",
split="train")
print(f"dataset has {len(dataset)} entries")
@ -214,10 +215,10 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
async def get_request(
input_requests: List[SampleRequest],
input_requests: list[SampleRequest],
request_rate: float,
burstiness: float = 1.0,
) -> AsyncGenerator[Tuple[int, SampleRequest], None]:
) -> AsyncGenerator[tuple[int, SampleRequest], None]:
"""
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness.
@ -258,23 +259,23 @@ async def get_request(
def calculate_metrics(
input_requests: List[Tuple[str, int, int]],
outputs: List[RequestFuncOutput],
input_requests: list[tuple[str, int, int]],
outputs: list[RequestFuncOutput],
dur_s: float,
tokenizer: PreTrainedTokenizerBase,
selected_percentile_metrics: List[str],
selected_percentiles: List[float],
goodput_config_dict: Optional[Dict[str, float]] = None,
) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens: List[int] = []
selected_percentile_metrics: list[str],
selected_percentiles: list[float],
goodput_config_dict: Optional[dict[str, float]] = None,
) -> tuple[BenchmarkMetrics, list[int]]:
actual_output_lens: list[int] = []
total_input = 0
completed = 0
good_completed = 0
itls: List[float] = []
tpots: List[float] = []
all_tpots: List[float] = []
ttfts: List[float] = []
e2els: List[float] = []
itls: list[float] = []
tpots: list[float] = []
all_tpots: list[float] = []
ttfts: list[float] = []
e2els: list[float] = []
for i in range(len(outputs)):
if outputs[i].success:
# We use the tokenizer to count the number of output tokens for all
@ -368,18 +369,18 @@ async def benchmark(
base_url: str,
model_id: str,
tokenizer: PreTrainedTokenizerBase,
input_requests: List[SampleRequest],
input_requests: list[SampleRequest],
request_rate: float,
burstiness: float,
disable_tqdm: bool,
profile: bool,
selected_percentile_metrics: List[str],
selected_percentiles: List[str],
selected_percentile_metrics: list[str],
selected_percentiles: list[str],
ignore_eos: bool,
max_concurrency: Optional[int],
guided_decoding_ratio: float,
guided_decoding_backend: str,
goodput_config_dict: Optional[Dict[str, float]] = None,
goodput_config_dict: Optional[dict[str, float]] = None,
):
if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend]
@ -459,8 +460,8 @@ async def benchmark(
pbar=pbar)
benchmark_start_time = time.perf_counter()
tasks: List[asyncio.Task] = []
expected: List[str] = []
tasks: list[asyncio.Task] = []
expected: list[str] = []
async for i, request in get_request(input_requests, request_rate,
burstiness):
extra_body = prepare_extra_body(
@ -479,7 +480,7 @@ async def benchmark(
asyncio.create_task(
limited_request_func(request_func_input=request_func_input,
pbar=pbar)))
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")

View File

@ -7,7 +7,7 @@ import os
import random
import time
from functools import cache
from typing import Any, Dict, List, Optional, Tuple
from typing import Any, Optional
import torch
import uvloop
@ -74,12 +74,12 @@ def lora_path_on_disk(lora_path: str) -> str:
return get_adapter_absolute_path(lora_path)
lora_tokenizer_cache: Dict[int, AnyTokenizer] = {}
lora_tokenizer_cache: dict[int, AnyTokenizer] = {}
def get_random_lora_request(
args: argparse.Namespace
) -> Tuple[LoRARequest, Optional[AnyTokenizer]]:
) -> tuple[LoRARequest, Optional[AnyTokenizer]]:
global lora_tokenizer_cache
lora_id = random.randint(1, args.max_loras)
lora_request = LoRARequest(lora_name=str(lora_id),
@ -91,7 +91,7 @@ def get_random_lora_request(
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
args: argparse.Namespace) -> list[SampleRequest]:
dataset_path: str = args.dataset
num_requests: int = args.num_prompts
@ -109,7 +109,7 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[SampleRequest] = []
filtered_dataset: list[SampleRequest] = []
for data in tqdm(dataset,
total=len(filtered_dataset),
desc="sampling requests"):
@ -165,7 +165,7 @@ def sample_requests(tokenizer: PreTrainedTokenizerBase,
def run_vllm(
requests: List[SampleRequest],
requests: list[SampleRequest],
n: int,
engine_args: EngineArgs,
) -> float:
@ -178,8 +178,8 @@ def run_vllm(
"Please ensure that max_model_len is greater than the sum of"
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
prompts: list[TextPrompt] = []
sampling_params: list[SamplingParams] = []
for request in requests:
prompts.append(
TextPrompt(prompt=request.prompt,
@ -192,7 +192,7 @@ def run_vllm(
ignore_eos=True,
max_tokens=request.expected_output_len,
))
lora_requests: Optional[List[LoRARequest]] = None
lora_requests: Optional[list[LoRARequest]] = None
if engine_args.enable_lora:
lora_requests = [request.lora_request for request in requests]
@ -225,7 +225,7 @@ def run_vllm(
async def run_vllm_async(
requests: List[SampleRequest],
requests: list[SampleRequest],
n: int,
engine_args: AsyncEngineArgs,
disable_frontend_multiprocessing: bool = False,
@ -242,9 +242,9 @@ async def run_vllm_async(
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
lora_requests: List[Optional[LoRARequest]] = []
prompts: list[TextPrompt] = []
sampling_params: list[SamplingParams] = []
lora_requests: list[Optional[LoRARequest]] = []
for request in requests:
prompts.append(
TextPrompt(prompt=request.prompt,
@ -276,7 +276,7 @@ async def run_vllm_async(
def run_hf(
requests: List[SampleRequest],
requests: list[SampleRequest],
model: str,
tokenizer: PreTrainedTokenizerBase,
n: int,
@ -292,7 +292,7 @@ def run_hf(
pbar = tqdm(total=len(requests))
start = time.perf_counter()
batch: List[str] = []
batch: list[str] = []
max_prompt_len = 0
max_output_len = 0
for i in range(len(requests)):
@ -334,7 +334,7 @@ def run_hf(
def run_mii(
requests: List[SampleRequest],
requests: list[SampleRequest],
model: str,
tensor_parallel_size: int,
output_len: int,
@ -352,7 +352,7 @@ def run_mii(
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: Dict[str, Any]) -> None:
results: dict[str, Any]) -> None:
pt_records = convert_to_pytorch_benchmark_format(
args=args,
metrics={
@ -479,8 +479,8 @@ if __name__ == "__main__":
type=str,
default=None,
help="Path to the dataset. The dataset is expected to "
"be a json in form of List[Dict[..., conversations: "
"List[Dict[..., value: <prompt_or_response>]]]]")
"be a json in form of list[dict[..., conversations: "
"list[dict[..., value: <prompt_or_response>]]]]")
parser.add_argument("--input-len",
type=int,
default=None,

View File

@ -4,12 +4,12 @@ import argparse
import json
import math
import os
from typing import Any, Dict, List
from typing import Any
def convert_to_pytorch_benchmark_format(args: argparse.Namespace,
metrics: Dict[str, List],
extra_info: Dict[str, Any]) -> List:
metrics: dict[str, list],
extra_info: dict[str, Any]) -> list:
"""
Save the benchmark results in the format used by PyTorch OSS benchmark with
on metric per record
@ -64,6 +64,6 @@ class InfEncoder(json.JSONEncoder):
return super().iterencode(self.clear_inf(o), *args, **kwargs)
def write_to_json(filename: str, records: List) -> None:
def write_to_json(filename: str, records: list) -> None:
with open(filename, "w") as f:
json.dump(records, f, cls=InfEncoder)

View File

@ -5,7 +5,8 @@ import copy
import itertools
import pickle as pkl
import time
from typing import Callable, Iterable, List, Tuple
from collections.abc import Iterable
from typing import Callable
import torch
import torch.utils.benchmark as TBenchmark
@ -228,7 +229,7 @@ def print_timers(timers: Iterable[TMeasurement]):
def run(dtype: torch.dtype,
MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
MKNs: Iterable[tuple[int, int, int]]) -> Iterable[TMeasurement]:
results = []
for m, k, n in MKNs:
timers = bench(dtype, m, k, n, f"scaled-{dtype}-gemm",
@ -241,7 +242,7 @@ def run(dtype: torch.dtype,
# output makers
def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]],
MKNs: Iterable[tuple[int, int, int]],
base_description: str,
timestamp=None):
print(f"== All Results {base_description} ====")
@ -282,7 +283,7 @@ def run_model_bench(args):
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> List[Tuple[int, int]]:
def model_shapes(model_name: str, tp_size: int) -> list[tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
# Cutlass bench utils
from typing import Iterable, Tuple
from collections.abc import Iterable
import torch
@ -27,7 +27,7 @@ def to_fp16(tensor: torch.Tensor) -> torch.Tensor:
def make_rand_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
k: int) -> tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
@ -63,7 +63,7 @@ def prune_to_2_4(tensor):
def make_rand_sparse_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
k: int) -> tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
@ -88,7 +88,7 @@ def make_rand_sparse_tensors(dtype: torch.dtype, m: int, n: int,
def make_n_rand_sparse_tensors(num_tensors: int, dtype: torch.dtype,
m: int, n: int, k: int) -> \
Tuple[Iterable[torch.Tensor], Iterable[torch.Tensor]]:
tuple[Iterable[torch.Tensor], Iterable[torch.Tensor]]:
ABs = []
for _ in range(num_tensors):
b_comp, e, a, b = make_rand_sparse_tensors(dtype, m, n, k)

View File

@ -5,7 +5,8 @@ import copy
import itertools
import pickle as pkl
import time
from typing import Callable, Iterable, List, Optional, Tuple
from collections.abc import Iterable
from typing import Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark
@ -49,7 +50,7 @@ def bench_int8(
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark INT8-based kernels."""
assert dtype == torch.int8
a, b = make_rand_tensors(torch.int8, m, n, k)
@ -101,7 +102,7 @@ def bench_fp8(
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark FP8-based kernels."""
assert dtype == torch.float8_e4m3fn
a, b = make_rand_tensors(torch.float8_e4m3fn, m, n, k)
@ -180,7 +181,7 @@ def bench(dtype: torch.dtype,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
if dtype == torch.int8:
return bench_int8(dtype, m, k, n, label, sub_label, bench_kernels)
if dtype == torch.float8_e4m3fn:
@ -195,8 +196,8 @@ def print_timers(timers: Iterable[TMeasurement]):
def run(dtype: torch.dtype,
MKNs: Iterable[Tuple[int, int, int]],
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
MKNs: Iterable[tuple[int, int, int]],
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
results = []
for m, k, n in MKNs:
timers = bench(dtype,
@ -212,7 +213,7 @@ def run(dtype: torch.dtype,
def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]],
MKNs: Iterable[tuple[int, int, int]],
base_description: str,
timestamp=None):
print(f"== All Results {base_description} ====")
@ -248,7 +249,7 @@ def run_model_bench(args):
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> List[Tuple[int, int]]:
def model_shapes(model_name: str, tp_size: int) -> list[tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size

View File

@ -2,9 +2,10 @@
import pickle as pkl
import time
from collections.abc import Iterable
from dataclasses import dataclass
from itertools import product
from typing import Callable, Iterable, List, Optional
from typing import Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark
@ -29,7 +30,7 @@ class bench_params_t:
f'x DT {self.dtype}')
def get_bench_params() -> List[bench_params_t]:
def get_bench_params() -> list[bench_params_t]:
## Test Fixtures
NUM_TOKENS = [2**x for x in range(11)]
HIDDEN_SIZES = list(range(1024, 8129, 1024))

View File

@ -9,7 +9,7 @@ from dataclasses import dataclass
from enum import Enum, auto
from itertools import product
from pathlib import Path
from typing import Any, Callable, Dict, List, Optional, Tuple
from typing import Any, Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark
@ -61,15 +61,15 @@ def make_rand_lora_weight_tensor(k: int,
def make_rand_tensors(
a_shape: Tuple[int],
b_shape: Tuple[int],
c_shape: Tuple[int],
a_shape: tuple[int],
b_shape: tuple[int],
c_shape: tuple[int],
a_dtype: torch.dtype,
b_dtype: torch.dtype,
c_dtype: torch.dtype,
num_slices: int,
device: str = "cuda",
) -> Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor]:
) -> tuple[torch.Tensor, list[torch.Tensor], torch.Tensor]:
"""
Make LoRA input/output matrices.
"""
@ -135,7 +135,7 @@ def make_token_lora_mapping(num_tokens: int, num_prompts: int,
def ref_group_gemm(ref_out: torch.Tensor, input: torch.Tensor,
lora_weights: List[torch.Tensor],
lora_weights: list[torch.Tensor],
seq_lens_cpu: torch.Tensor,
prompt_lora_mapping_cpu: torch.Tensor, scaling: float,
add_inputs: Optional[bool]):
@ -204,7 +204,7 @@ class OpType(Enum):
def is_expand_slice_fn(self) -> bool:
return self in [OpType.BGMV_EXPAND_SLICE]
def num_slices(self) -> List[int]:
def num_slices(self) -> list[int]:
if self in [OpType.SGMV_EXPAND, OpType.SGMV_SHRINK]:
# SGMV kernels supports slices
return [1, 2, 3]
@ -215,7 +215,7 @@ class OpType(Enum):
raise ValueError(f"Unrecognized OpType {self}")
def mkn(self, batch_size: int, seq_length: int, hidden_size: int,
lora_rank: int) -> Tuple[int, int, int]:
lora_rank: int) -> tuple[int, int, int]:
num_tokens = batch_size * seq_length
if self.is_shrink_fn():
m = num_tokens
@ -230,7 +230,7 @@ class OpType(Enum):
def matmul_dtypes(
self, op_dtype: torch.dtype
) -> Tuple[torch.dtype, torch.dtype, torch.dtype]:
) -> tuple[torch.dtype, torch.dtype, torch.dtype]:
"""
return a type, b type and c type for A x B = C
"""
@ -243,7 +243,7 @@ class OpType(Enum):
def matmul_shapes(
self, batch_size: int, seq_length: int, hidden_size: int,
lora_rank: int, num_loras: int,
num_slices: int) -> Tuple[Tuple[int], Tuple[int], Tuple[int]]:
num_slices: int) -> tuple[tuple[int], tuple[int], tuple[int]]:
"""
Given num_slices, return the shapes of the A, B, and C matrices
in A x B = C, for the op_type
@ -268,7 +268,7 @@ class OpType(Enum):
def bench_fn(self) -> Callable:
def emulate_bgmv_expand_slice(kwargs_list: List[Dict[str, Any]]):
def emulate_bgmv_expand_slice(kwargs_list: list[dict[str, Any]]):
for x in kwargs_list:
bgmv_expand_slice(**x)
@ -285,7 +285,7 @@ class OpType(Enum):
raise ValueError(f"Unrecognized optype {self}")
def run_ref_group_gemm(self, output: torch.Tensor, input: torch.Tensor,
lora_weights: List[torch.Tensor],
lora_weights: list[torch.Tensor],
**kwargs) -> Callable:
"""Each benchmark operation expected the input, lora_weights and outputs
in a slightly different format. Refer to self.matmul_shapes().
@ -384,7 +384,7 @@ class BenchmarkTensors:
"""
# matmul tensors
input: torch.Tensor
lora_weights_lst: List[torch.Tensor]
lora_weights_lst: list[torch.Tensor]
output: torch.Tensor
# metadata tensors
seq_lens: torch.Tensor
@ -469,7 +469,7 @@ class BenchmarkTensors:
for i in range(len(self.lora_weights_lst)):
self.lora_weights_lst[i] = to_device(self.lora_weights_lst[i])
def metadata(self) -> Tuple[int, int, int]:
def metadata(self) -> tuple[int, int, int]:
"""
Return num_seqs, num_tokens and max_seq_len
"""
@ -505,7 +505,7 @@ class BenchmarkTensors:
self.seq_lens = seq_lens.to(dtype=self.seq_lens.dtype)
self.seq_start_loc = seq_start_loc.to(dtype=self.seq_start_loc.dtype)
def as_sgmv_shrink_kwargs(self) -> Dict[str, Any]:
def as_sgmv_shrink_kwargs(self) -> dict[str, Any]:
self.convert_to_sgmv_benchmark_tensors()
self.sanity_check()
self.to_device(self.input.device)
@ -540,7 +540,7 @@ class BenchmarkTensors:
'scaling': 1.0,
}
def as_sgmv_expand_kwargs(self, add_inputs: bool) -> Dict[str, Any]:
def as_sgmv_expand_kwargs(self, add_inputs: bool) -> dict[str, Any]:
self.convert_to_sgmv_benchmark_tensors()
self.sanity_check()
@ -578,7 +578,7 @@ class BenchmarkTensors:
'add_inputs': add_inputs,
}
def as_bgmv_shrink_kwargs(self) -> Dict[str, Any]:
def as_bgmv_shrink_kwargs(self) -> dict[str, Any]:
assert len(self.lora_weights_lst) == 1
self.to_device(self.input.device)
@ -634,7 +634,7 @@ class BenchmarkTensors:
'add_inputs': add_inputs
}
def as_bgmv_expand_slice_kwargs(self, add_inputs: bool) -> Dict[str, Any]:
def as_bgmv_expand_slice_kwargs(self, add_inputs: bool) -> dict[str, Any]:
_, num_tokens, _, num_slices = self.metadata()
# Sanity check shapes
@ -670,7 +670,7 @@ class BenchmarkTensors:
def bench_fn_kwargs(self,
op_type: OpType,
add_inputs: Optional[bool] = None) -> Dict[str, Any]:
add_inputs: Optional[bool] = None) -> dict[str, Any]:
if op_type.is_shrink_fn():
assert add_inputs is None
else:
@ -734,7 +734,7 @@ def bench_optype(ctx: BenchmarkContext,
assert expand_fn_add_inputs is not None
# BenchmarkContext -> BenchmarkTensors
bench_tensors : List[BenchmarkTensors] = \
bench_tensors : list[BenchmarkTensors] = \
[BenchmarkTensors.make(ctx, op_type) for _ in range(arg_pool_size)]
for bt in bench_tensors:
bt.sanity_check()
@ -746,7 +746,7 @@ def bench_optype(ctx: BenchmarkContext,
for bt in bench_tensors
])
# BenchmarkTensors -> Dict (kwargs)
# BenchmarkTensors -> dict (kwargs)
kwargs_list = [
bt.bench_fn_kwargs(op_type, add_inputs=expand_fn_add_inputs)
for bt in bench_tensors
@ -841,7 +841,7 @@ def use_cuda_graph_recommendation() -> str:
"""
def print_timers(timers: List[TMeasurement],
def print_timers(timers: list[TMeasurement],
args: Optional[argparse.Namespace] = None):
compare = TBenchmark.Compare(timers)
compare.print()
@ -861,7 +861,7 @@ def print_timers(timers: List[TMeasurement],
"small num_loras the goal should be to match the torch.mm numbers.")
def run(args: argparse.Namespace, bench_ctxs: List[BenchmarkContext]):
def run(args: argparse.Namespace, bench_ctxs: list[BenchmarkContext]):
if args.cuda_graph_nops is not None:
assert args.cuda_graph_nops > 0
@ -873,7 +873,7 @@ def run(args: argparse.Namespace, bench_ctxs: List[BenchmarkContext]):
timers = []
for bench_ctx in bench_ctxs:
for seq_len in args.seq_lengths:
bench_ops: List[OpType] = []
bench_ops: list[OpType] = []
if seq_len == 1:
# bench all decode ops
bench_ops = [op for op in args.op_types if op.is_decode_op()]
@ -921,10 +921,10 @@ def run(args: argparse.Namespace, bench_ctxs: List[BenchmarkContext]):
pickle.dump(timers, f)
def as_benchmark_contexts(hidden_sizes: List[int], lora_ranks: List[int],
args: argparse.Namespace) -> List[BenchmarkContext]:
def as_benchmark_contexts(hidden_sizes: list[int], lora_ranks: list[int],
args: argparse.Namespace) -> list[BenchmarkContext]:
ctxs: List[BenchmarkContext] = []
ctxs: list[BenchmarkContext] = []
for batch_size, hidden_size, lora_rank, num_loras, sort_by_lora_id in product( # noqa
args.batch_sizes, list(hidden_sizes), lora_ranks, args.num_loras,
args.sort_by_lora_id):
@ -954,7 +954,7 @@ def run_list_bench(args: argparse.Namespace):
f" LoRA Ranks {args.lora_ranks}")
# Get all benchmarking contexts
bench_contexts: List[BenchmarkContext] = as_benchmark_contexts(
bench_contexts: list[BenchmarkContext] = as_benchmark_contexts(
hidden_sizes=args.hidden_sizes, lora_ranks=args.lora_ranks, args=args)
run(args, bench_contexts)
@ -975,7 +975,7 @@ def run_range_bench(args: argparse.Namespace):
f" LoRA Ranks {lora_ranks}")
# Get all benchmarking contexts
bench_contexts: List[BenchmarkContext] = as_benchmark_contexts(
bench_contexts: list[BenchmarkContext] = as_benchmark_contexts(
hidden_sizes=hidden_sizes, lora_ranks=lora_ranks, args=args)
run(args, bench_contexts)
@ -1002,7 +1002,7 @@ def run_model_bench(args: argparse.Namespace):
f" LoRA Ranks {args.lora_ranks}")
# Get all benchmarking contexts
bench_contexts: List[BenchmarkContext] = as_benchmark_contexts(
bench_contexts: list[BenchmarkContext] = as_benchmark_contexts(
hidden_sizes=hidden_sizes, lora_ranks=args.lora_ranks, args=args)
run(args, bench_contexts)

View File

@ -7,9 +7,10 @@ import math
import os
import pickle as pkl
import time
from collections.abc import Iterable
from dataclasses import dataclass
from itertools import product
from typing import Callable, Iterable, List, Optional, Tuple
from typing import Callable, Optional
import pandas as pd
import torch
@ -102,8 +103,8 @@ def quantize_and_pack(atype: torch.dtype,
return w_ref, w_q, w_s, w_zp
def create_bench_tensors(shape: Tuple[int, int, int], types: TypeConfig,
group_size: Optional[int]) -> List[BenchmarkTensors]:
def create_bench_tensors(shape: tuple[int, int, int], types: TypeConfig,
group_size: Optional[int]) -> list[BenchmarkTensors]:
m, n, k = shape
# we want to make sure that weights don't fit into L2 cache between runs so
@ -114,7 +115,7 @@ def create_bench_tensors(shape: Tuple[int, int, int], types: TypeConfig,
a = rand_data((m, k), types.act_type, scale=5)
benchmark_tensors: List[BenchmarkTensors] = []
benchmark_tensors: list[BenchmarkTensors] = []
for _ in range(num_weights):
w = rand_data((k, n), types.act_type, scale=5)
@ -276,7 +277,7 @@ def machete_create_bench_fn(bt: BenchmarkTensors,
def bench_fns(label: str, sub_label: str, description: str,
fns: List[Callable]):
fns: list[Callable]):
min_run_time = 1 if not NVTX_PROFILE else 0.1
res = TBenchmark.Timer(
@ -311,7 +312,7 @@ def bench(types: TypeConfig,
n: int,
label: str,
sub_label: str,
sweep_schedules: bool = True) -> List[TMeasurement]:
sweep_schedules: bool = True) -> list[TMeasurement]:
benchmark_tensors = create_bench_tensors((m, n, k), types, group_size)
sub_label += f", L={len(benchmark_tensors)}"
@ -414,12 +415,12 @@ def bench(types: TypeConfig,
# runner
def print_timers(timers: List[TMeasurement]):
def print_timers(timers: list[TMeasurement]):
compare = TBenchmark.Compare(timers)
compare.print()
def run(args, MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
def run(args, MKNs: Iterable[tuple[int, int, int]]) -> Iterable[TMeasurement]:
types = TypeConfig(
act_type=args.act_type,
weight_type=scalar_types.uint4b8 if args.group_zero_type is None \
@ -431,7 +432,7 @@ def run(args, MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
token_scale_type=args.token_scale_type,
)
results: List[TMeasurement] = []
results: list[TMeasurement] = []
for m, k, n in MKNs:
timers = bench(types,
args.group_size,
@ -449,8 +450,8 @@ def run(args, MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
# output makers
def make_output(
data: List[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]],
data: list[TMeasurement],
MKNs: Iterable[tuple[int, int, int]],
base_description: str,
timestamp=None,
):
@ -497,7 +498,7 @@ def run_model_bench(args):
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> List[Tuple[int, int]]:
def model_shapes(model_name: str, tp_size: int) -> list[tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size

View File

@ -1,7 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from typing import List
import torch
import torch.utils.benchmark as benchmark
from benchmark_shapes import WEIGHT_SHAPES
@ -10,6 +8,8 @@ from vllm import _custom_ops as ops
from vllm.model_executor.layers.quantization.gptq_marlin_24 import (
GPTQ_MARLIN_24_MAX_PARALLEL, GPTQ_MARLIN_24_MIN_THREAD_N,
GPTQ_MARLIN_24_SUPPORTED_GROUP_SIZES, GPTQ_MARLIN_24_SUPPORTED_QUANT_TYPES)
from vllm.model_executor.layers.quantization.utils.allspark_utils import (
ALLSPARK_AMPERE_M_CUBLAS_THRESHOLD, ALLSPARK_SUPPORTED_QUANT_TYPES)
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
GPTQ_MARLIN_MAX_PARALLEL, GPTQ_MARLIN_MIN_THREAD_N,
MARLIN_SUPPORTED_GROUP_SIZES, query_marlin_supported_quant_types)
@ -18,18 +18,18 @@ from vllm.model_executor.layers.quantization.utils.marlin_utils_test import (
from vllm.model_executor.layers.quantization.utils.marlin_utils_test_24 import (
marlin_24_quantize)
from vllm.model_executor.layers.quantization.utils.quant_utils import (
gptq_pack, gptq_quantize_weights, sort_weights)
gptq_pack, gptq_quantize_weights, quantize_weights, sort_weights)
from vllm.scalar_type import ScalarType
from vllm.utils import FlexibleArgumentParser
DEFAULT_MODELS = ["meta-llama/Llama-2-7b-hf/TP1"]
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
ACT_ORDER_OPTS = [False, True]
K_FULL_OPTS = [False, True]
def bench_run(results: List[benchmark.Measurement], model: str,
def bench_run(results: list[benchmark.Measurement], model: str,
act_order: bool, is_k_full: bool, quant_type: ScalarType,
group_size: int, size_m: int, size_k: int, size_n: int):
label = "Quant Matmul"
@ -81,6 +81,27 @@ def bench_run(results: List[benchmark.Measurement], model: str,
GPTQ_MARLIN_24_MAX_PARALLEL)
marlin_zp = torch.zeros_like(marlin_s, dtype=torch.int)
# AllSpark W8A16 quant
as_supported_case = (quant_type in ALLSPARK_SUPPORTED_QUANT_TYPES
and group_size == -1 and not act_order and is_k_full)
if as_supported_case:
properties = torch.cuda.get_device_properties(b.device.index)
sm_count = properties.multi_processor_count
sm_version = properties.major * 10 + properties.minor
supported_arch = (sm_version >= 80 and sm_version < 90)
as_supported_case = as_supported_case and supported_arch
if supported_arch:
has_zp = False
w_ref, qw, s, zp = quantize_weights(b, quant_type, group_size,
has_zp)
qw = qw.to(torch.uint8)
qw_reorder, s_reorder, zp_reorder = \
ops.allspark_repack_weight(
qw, s, zp, has_zp)
CUBLAS_M_THRESHOLD = ALLSPARK_AMPERE_M_CUBLAS_THRESHOLD
globals = {
# Gen params
"quant_type": quant_type,
@ -109,10 +130,19 @@ def bench_run(results: List[benchmark.Measurement], model: str,
# GPTQ params
"q_w_gptq": q_w_gptq,
"repack_sort_indices": repack_sort_indices,
# AllSpark W8A16 params
"qw_reorder": qw_reorder if as_supported_case else None,
"s_reorder": s_reorder if as_supported_case else None,
"zp_reorder": zp_reorder if as_supported_case else None,
"sm_count": sm_count if as_supported_case else None,
"sm_version": sm_version if as_supported_case else None,
"CUBLAS_M_THRESHOLD":
CUBLAS_M_THRESHOLD if as_supported_case else None,
# Kernels
"gptq_marlin_gemm": ops.gptq_marlin_gemm,
"gptq_marlin_24_gemm": ops.gptq_marlin_24_gemm,
"gptq_marlin_repack": ops.gptq_marlin_repack,
"allspark_w8a16_gemm": ops.allspark_w8a16_gemm,
}
min_run_time = 1
@ -172,13 +202,24 @@ def bench_run(results: List[benchmark.Measurement], model: str,
description="gptq_marlin_repack",
).blocked_autorange(min_run_time=min_run_time))
if as_supported_case:
results.append(
benchmark.Timer(
stmt=
"output = allspark_w8a16_gemm(a, qw_reorder, s_reorder, zp_reorder, size_n, group_size, sm_count, sm_version, CUBLAS_M_THRESHOLD, False, True)", # noqa: E501
globals=globals,
label=label,
sub_label=sub_label,
description="allspark_w8a16_gemm_fp32",
).blocked_autorange(min_run_time=min_run_time))
def main(args):
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
results: List[benchmark.Measurement] = []
results: list[benchmark.Measurement] = []
for model in args.models:
for layer in WEIGHT_SHAPES[model]:

View File

@ -2,9 +2,10 @@
import argparse
import time
from contextlib import nullcontext
from datetime import datetime
from itertools import product
from typing import Any, Dict, List, Tuple, TypedDict
from typing import Any, TypedDict
import ray
import torch
@ -40,6 +41,7 @@ def benchmark_config(
use_fp8_w8a8: bool,
use_int8_w8a16: bool,
num_iters: int = 100,
block_quant_shape: List[int] = None,
) -> float:
init_dtype = torch.float16 if use_fp8_w8a8 else dtype
x = torch.randn(num_tokens, hidden_size, dtype=dtype)
@ -81,8 +83,24 @@ def benchmark_config(
dtype=torch.float32)
w2_scale = torch.randn((hidden_size, num_experts), dtype=torch.float32)
if use_fp8_w8a8:
w1_scale = torch.randn(num_experts, dtype=torch.float32)
w2_scale = torch.randn(num_experts, dtype=torch.float32)
if block_quant_shape:
block_n, block_k = block_quant_shape[0], block_quant_shape[1]
E = num_experts
N = shard_intermediate_size // 2
K = hidden_size
factor_for_scale = 1e-2
n_tiles_w1 = (2 * N + block_n - 1) // block_n
n_tiles_w2 = (K + block_n - 1) // block_n
k_tiles_w1 = (K + block_k - 1) // block_k
k_tiles_w2 = (N + block_k - 1) // block_k
w1_scale = torch.rand((E, n_tiles_w1, k_tiles_w1),
dtype=torch.float32) * factor_for_scale
w2_scale = torch.rand((E, n_tiles_w2, k_tiles_w2),
dtype=torch.float32) * factor_for_scale
else:
w1_scale = torch.randn(num_experts, dtype=torch.float32)
w2_scale = torch.randn(num_experts, dtype=torch.float32)
a1_scale = torch.randn(1, dtype=torch.float32)
a2_scale = torch.randn(1, dtype=torch.float32)
@ -111,6 +129,7 @@ def benchmark_config(
w2_scale=w2_scale,
a1_scale=a1_scale,
a2_scale=a2_scale,
block_shape=block_quant_shape,
)
# JIT compilation & warmup
@ -132,7 +151,7 @@ def benchmark_config(
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
latencies: List[float] = []
latencies: list[float] = []
for i in range(num_iters):
prepare(i)
torch.cuda.synchronize()
@ -175,8 +194,9 @@ def get_rocm_tuning_space(use_fp16):
return param_ranges
def get_configs_compute_bound(use_fp16) -> List[Dict[str, int]]:
configs: List[BenchmarkConfig] = []
def get_configs_compute_bound(use_fp16,
block_quant_shape) -> list[dict[str, int]]:
configs: list[BenchmarkConfig] = []
if current_platform.is_rocm():
param_ranges = get_rocm_tuning_space(use_fp16)
@ -204,17 +224,27 @@ def get_configs_compute_bound(use_fp16) -> List[Dict[str, int]]:
for config_values in product(*values):
config = dict(zip(keys, config_values))
configs.append(config)
# Remove configs that are not compatible with fp8 block quantization
# BLOCK_SIZE_K must be a multiple of block_k
# BLOCK_SIZE_N must be a multiple of block_n
if block_quant_shape is not None and not use_fp16:
block_n, block_k = block_quant_shape[0], block_quant_shape[1]
for config in configs[:]:
if config["BLOCK_SIZE_K"] % block_k != 0 or config[
"BLOCK_SIZE_N"] % block_n != 0:
configs.remove(config)
return configs
def prune_rocm_search_space(num_tokens, shard_intermediate_size, hidden_size,
search_space, is_fp16):
search_space, is_fp16, topk):
N1, K1 = shard_intermediate_size, hidden_size
N2, K2 = hidden_size, shard_intermediate_size // 2
pruned_space_1 = prune_rocm_configs(num_tokens * 2, N1, K1, search_space,
is_fp16)
pruned_space_2 = prune_rocm_configs(num_tokens * 2, N2, K2, search_space,
is_fp16)
pruned_space_1 = prune_rocm_configs(num_tokens * topk, N1, K1,
search_space, is_fp16)
pruned_space_2 = prune_rocm_configs(num_tokens * topk, N2, K2,
search_space, is_fp16)
search_space = merge_unique_dicts(pruned_space_1, pruned_space_2)
return search_space
@ -335,7 +365,7 @@ class BenchmarkWorker:
dtype: torch.dtype,
use_fp8_w8a8: bool,
use_int8_w8a16: bool,
) -> Tuple[Dict[str, int], float]:
) -> tuple[dict[str, int], float]:
current_platform.seed_everything(self.seed)
dtype_str = get_config_dtype_str(dtype,
use_int8_w8a16=use_int8_w8a16,
@ -371,8 +401,9 @@ class BenchmarkWorker:
dtype: torch.dtype,
use_fp8_w8a8: bool,
use_int8_w8a16: bool,
search_space: List[Dict[str, int]],
) -> Dict[str, int]:
search_space: list[dict[str, int]],
block_quant_shape: list[int],
) -> dict[str, int]:
best_config = None
best_time = float("inf")
if current_platform.is_rocm():
@ -380,21 +411,24 @@ class BenchmarkWorker:
search_space = prune_rocm_search_space(num_tokens,
shard_intermediate_size,
hidden_size, search_space,
is_fp16)
is_fp16, topk)
with torch.cuda.device(self.device_id):
with torch.cuda.device(self.device_id) if current_platform.is_rocm(
) else nullcontext():
for config in tqdm(search_space):
try:
kernel_time = benchmark_config(config,
num_tokens,
num_experts,
shard_intermediate_size,
hidden_size,
topk,
dtype,
use_fp8_w8a8,
use_int8_w8a16,
num_iters=20)
kernel_time = benchmark_config(
config,
num_tokens,
num_experts,
shard_intermediate_size,
hidden_size,
topk,
dtype,
use_fp8_w8a8,
use_int8_w8a16,
num_iters=20,
block_quant_shape=block_quant_shape)
except triton.runtime.autotuner.OutOfResources:
# Some configurations may be invalid and fail to compile.
continue
@ -434,10 +468,10 @@ def sort_config(config: BenchmarkConfig) -> BenchmarkConfig:
}
def save_configs(configs: Dict[int, BenchmarkConfig], num_experts: int,
def save_configs(configs: dict[int, BenchmarkConfig], num_experts: int,
shard_intermediate_size: int, hidden_size: int, topk: int,
dtype: torch.dtype, use_fp8_w8a8: bool,
use_int8_w8a16: bool) -> None:
dtype: torch.dtype, use_fp8_w8a8: bool, use_int8_w8a16: bool,
block_quant_shape: List[int]) -> None:
dtype_str = get_config_dtype_str(dtype,
use_int8_w8a16=use_int8_w8a16,
use_fp8_w8a8=use_fp8_w8a8)
@ -445,7 +479,7 @@ def save_configs(configs: Dict[int, BenchmarkConfig], num_experts: int,
# NOTE(woosuk): The current naming convention uses w2.shape[2], which
# is the intermediate size after silu_and_mul.
filename = get_config_file_name(num_experts, shard_intermediate_size // 2,
dtype_str)
dtype_str, block_quant_shape)
print(f"Writing best config to {filename}...")
with open(filename, "w") as f:
@ -455,7 +489,7 @@ def save_configs(configs: Dict[int, BenchmarkConfig], num_experts: int,
def main(args: argparse.Namespace):
print(args)
block_quant_shape = None
config = AutoConfig.from_pretrained(
args.model, trust_remote_code=args.trust_remote_code)
if config.architectures[0] == "DbrxForCausalLM":
@ -474,6 +508,7 @@ def main(args: argparse.Namespace):
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size
block_quant_shape = config.quantization_config['weight_block_size']
else:
# Default: Mixtral.
E = config.num_local_experts
@ -498,7 +533,7 @@ def main(args: argparse.Namespace):
num_gpus = int(ray.available_resources()["GPU"])
workers = [BenchmarkWorker.remote(args.seed) for _ in range(num_gpus)]
def _distribute(method: str, inputs: List[Any]) -> List[Any]:
def _distribute(method: str, inputs: list[Any]) -> list[Any]:
outputs = []
worker_idx = 0
for input_args in inputs:
@ -511,27 +546,30 @@ def main(args: argparse.Namespace):
if args.tune:
is_fp16 = not (use_fp8_w8a8 or use_int8_w8a16)
search_space = get_configs_compute_bound(is_fp16)
search_space = get_configs_compute_bound(is_fp16, block_quant_shape)
print(f"Start tuning over {len(search_space)} configurations...")
start = time.time()
configs = _distribute(
"tune", [(batch_size, E, shard_intermediate_size, hidden_size,
topk, dtype, use_fp8_w8a8, use_int8_w8a16, search_space)
for batch_size in batch_sizes])
"tune",
[(batch_size, E, shard_intermediate_size, hidden_size, topk, dtype,
use_fp8_w8a8, use_int8_w8a16, search_space, block_quant_shape)
for batch_size in batch_sizes])
best_configs = {
M: sort_config(config)
for M, config in zip(batch_sizes, configs)
}
save_configs(best_configs, E, shard_intermediate_size, hidden_size,
topk, dtype, use_fp8_w8a8, use_int8_w8a16)
topk, dtype, use_fp8_w8a8, use_int8_w8a16,
block_quant_shape)
end = time.time()
print(f"Tuning took {end - start:.2f} seconds")
else:
outputs = _distribute(
"benchmark", [(batch_size, E, shard_intermediate_size, hidden_size,
topk, dtype, use_fp8_w8a8, use_int8_w8a16)
for batch_size in batch_sizes])
"benchmark",
[(batch_size, E, shard_intermediate_size, hidden_size, topk, dtype,
use_fp8_w8a8, use_int8_w8a16, block_quant_shape)
for batch_size in batch_sizes])
for batch_size, (config, kernel_time) in zip(batch_sizes, outputs):
print(f"Batch size: {batch_size}, config: {config}")

View File

@ -2,7 +2,7 @@
import random
import time
from typing import List, Optional
from typing import Optional
import torch
@ -11,8 +11,9 @@ from vllm.platforms import current_platform
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
create_kv_caches_with_random)
NUM_BLOCKS = 1024
NUM_BLOCKS = 128 * 1024
PARTITION_SIZE = 512
PARTITION_SIZE_ROCM = 256
@torch.inference_mode()
@ -54,7 +55,7 @@ def main(
# Create the block tables.
max_num_blocks_per_seq = (max_seq_len + block_size - 1) // block_size
block_tables_lst: List[List[int]] = []
block_tables_lst: list[list[int]] = []
for _ in range(num_seqs):
block_table = [
random.randint(0, NUM_BLOCKS - 1)
@ -80,6 +81,12 @@ def main(
# Prepare for the paged attention kernel.
output = torch.empty_like(query)
if version == "v2":
if current_platform.is_rocm():
global PARTITION_SIZE
if not args.custom_paged_attn:
PARTITION_SIZE = 1024
else:
PARTITION_SIZE = PARTITION_SIZE_ROCM
num_partitions = ((max_seq_len + PARTITION_SIZE - 1) // PARTITION_SIZE)
tmp_output = torch.empty(
size=(num_seqs, num_query_heads, num_partitions, head_size),
@ -123,25 +130,46 @@ def main(
v_scale,
)
elif version == "v2":
ops.paged_attention_v2(
output,
exp_sums,
max_logits,
tmp_output,
query,
key_cache,
value_cache,
num_kv_heads,
scale,
block_tables,
seq_lens,
block_size,
max_seq_len,
alibi_slopes,
kv_cache_dtype,
k_scale,
v_scale,
)
if not args.custom_paged_attn:
ops.paged_attention_v2(
output,
exp_sums,
max_logits,
tmp_output,
query,
key_cache,
value_cache,
num_kv_heads,
scale,
block_tables,
seq_lens,
block_size,
max_seq_len,
alibi_slopes,
kv_cache_dtype,
k_scale,
v_scale,
)
else:
ops.paged_attention_rocm(
output,
exp_sums,
max_logits,
tmp_output,
query,
key_cache,
value_cache,
num_kv_heads,
scale,
block_tables,
seq_lens,
block_size,
max_seq_len,
alibi_slopes,
kv_cache_dtype,
k_scale,
v_scale,
)
else:
raise ValueError(f"Invalid version: {version}")
torch.cuda.synchronize()
@ -195,6 +223,9 @@ if __name__ == '__main__':
help="Data type for kv cache storage. If 'auto', will use model "
"data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. "
"ROCm (AMD GPU) supports fp8 (=fp8_e4m3)")
parser.add_argument("--custom-paged-attn",
action="store_true",
help="Use custom paged attention")
args = parser.parse_args()
print(args)

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
import itertools
from typing import Optional, Tuple, Union
from typing import Optional, Union
import torch
import triton
@ -22,7 +22,7 @@ class HuggingFaceRMSNorm(nn.Module):
self,
x: torch.Tensor,
residual: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
) -> Union[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]:
orig_dtype = x.dtype
x = x.to(torch.float32)
if residual is not None:

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
from itertools import accumulate
from typing import List, Optional
from typing import Optional
import nvtx
import torch
@ -39,7 +39,7 @@ def benchmark_rope_kernels_multi_lora(
})
# non-batched RoPE takes only one scaling factor, we create multiple
# instances to simulate the same behavior
non_batched_ropes: List[RotaryEmbedding] = []
non_batched_ropes: list[RotaryEmbedding] = []
for scaling_factor in scaling_factors:
non_batched_ropes.append(
get_rope(head_size, rotary_dim, max_position, base, is_neox_style,

View File

@ -4,7 +4,6 @@ import math
import pickle
import re
from collections import defaultdict
from typing import List
import matplotlib.pyplot as plt
import pandas as pd
@ -23,7 +22,7 @@ if __name__ == "__main__":
with open(args.filename, 'rb') as f:
data = pickle.load(f)
raw_results: List[TMeasurement] = data["results"]
raw_results: list[TMeasurement] = data["results"]
results = defaultdict(lambda: list())
for v in raw_results:

View File

@ -1,7 +1,8 @@
# SPDX-License-Identifier: Apache-2.0
import dataclasses
from typing import Any, Callable, Iterable, Optional
from collections.abc import Iterable
from typing import Any, Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark

View File

@ -2,6 +2,10 @@
#include <torch/all.h>
#include <cmath>
#if defined(__APPLE__)
#include "omp.h"
#endif
namespace vec_op {
#ifdef ARM_BF16_SUPPORT

View File

@ -22,7 +22,7 @@ struct identity {
T operator()(T lhs) const { return lhs; }
};
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct TrivialEpilogue {
private:
using Accum = cutlass::epilogue::fusion::Sm90AccFetch;
@ -44,32 +44,30 @@ struct TrivialEpilogue {
* This class provides the common load descriptors for the
* ScaledEpilogue[...] classes
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogueBase {
protected:
using Accum = cutlass::epilogue::fusion::Sm90AccFetch;
template <typename T>
using ColOrScalarLoad = cutlass::epilogue::fusion::Sm90ColOrScalarBroadcast<
0 /*Stages*/, typename EpilogueDescriptor::TileShape, T,
Stride<Int<1>, Int<0>, Int<0>>>;
0 /*Stages*/, TileShape, T, Stride<Int<1>, Int<0>, Int<0>>>;
template <typename T>
using RowOrScalarLoad = cutlass::epilogue::fusion::Sm90RowOrScalarBroadcast<
0 /*Stages*/, typename EpilogueDescriptor::TileShape, T,
Stride<Int<0>, Int<1>, Int<0>>>;
0 /*Stages*/, TileShape, T, Stride<Int<0>, Int<1>, Int<0>>>;
// Don't want to support nullptr by default
template <typename T, bool EnableNullPtr = false>
using ColLoad = cutlass::epilogue::fusion::Sm90ColBroadcast<
0 /*Stages*/, typename EpilogueDescriptor::TileShape, T, T,
Stride<Int<1>, Int<0>, Int<0>>, 128 / sizeof_bits_v<T>, EnableNullPtr>;
0 /*Stages*/, TileShape, T, T, Stride<Int<1>, Int<0>, Int<0>>,
128 / sizeof_bits_v<T>, EnableNullPtr>;
// Don't want to support nullptr by default
template <typename T, bool EnableNullPtr = false>
using RowLoad = cutlass::epilogue::fusion::Sm90RowBroadcast<
0 /*Stages*/, typename EpilogueDescriptor::TileShape, T, T,
Stride<Int<0>, Int<1>, Int<0>>, 128 / sizeof_bits_v<T>, EnableNullPtr>;
0 /*Stages*/, TileShape, T, T, Stride<Int<0>, Int<1>, Int<0>>,
128 / sizeof_bits_v<T>, EnableNullPtr>;
// This utility function constructs the arguments for the load descriptors
// from a tensor. It can handle both row and column, as well as row/column or
@ -116,11 +114,11 @@ struct ScaledEpilogueBase {
the A and B operands respectively. These scales may be either per-tensor or
per row or column.
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogue
: private ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor> {
: private ScaledEpilogueBase<ElementAcc, ElementD, TileShape> {
private:
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor>;
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, TileShape>;
using Accum = typename SUPER::Accum;
using ScaleA = typename SUPER::template ColOrScalarLoad<float>;
using ScaleB = typename SUPER::template RowOrScalarLoad<float>;
@ -160,11 +158,11 @@ struct ScaledEpilogue
* The bias tensor must be per-output channel.
* ScaleA and ScaleB can be per-tensor or per-token/per-channel.
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogueBias
: private ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor> {
: private ScaledEpilogueBase<ElementAcc, ElementD, TileShape> {
private:
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor>;
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, TileShape>;
using Accum = typename SUPER::Accum;
using ScaleA = typename SUPER::template ColOrScalarLoad<float>;
using ScaleB = typename SUPER::template RowOrScalarLoad<float>;
@ -203,11 +201,11 @@ struct ScaledEpilogueBias
* bias is a column vector instead of a row vector. Useful e.g. if we are
* computing a GEMM via C^T += B^T A^T. This happens in the 2:4 sparse kernels.
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogueColumnBias
: private ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor> {
: private ScaledEpilogueBase<ElementAcc, ElementD, TileShape> {
private:
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor>;
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, TileShape>;
using Accum = typename SUPER::Accum;
using ScaleA = typename SUPER::template ColOrScalarLoad<float>;
using ScaleB = typename SUPER::template RowOrScalarLoad<float>;
@ -249,11 +247,11 @@ struct ScaledEpilogueColumnBias
*
* This epilogue also supports bias, which remains per-channel.
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogueBiasAzp
: private ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor> {
: private ScaledEpilogueBase<ElementAcc, ElementD, TileShape> {
private:
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor>;
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, TileShape>;
using Accum = typename SUPER::Accum;
using ScaleA = typename SUPER::template ColOrScalarLoad<float>;
using ScaleB = typename SUPER::template RowOrScalarLoad<float>;
@ -314,11 +312,11 @@ struct ScaledEpilogueBiasAzp
*
* This epilogue also supports bias, which remains per-channel.
*/
template <typename ElementAcc, typename ElementD, typename EpilogueDescriptor>
template <typename ElementAcc, typename ElementD, typename TileShape>
struct ScaledEpilogueBiasAzpToken
: private ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor> {
: private ScaledEpilogueBase<ElementAcc, ElementD, TileShape> {
private:
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, EpilogueDescriptor>;
using SUPER = ScaledEpilogueBase<ElementAcc, ElementD, TileShape>;
using Accum = typename SUPER::Accum;
using ScaleA = typename SUPER::template ColOrScalarLoad<float>;
using ScaleB = typename SUPER::template RowOrScalarLoad<float>;

View File

@ -1,7 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
import enum
from typing import Dict, Union
from typing import Union
from cutlass_library import *
@ -21,7 +21,7 @@ class MixedInputKernelScheduleType(enum.Enum):
TmaWarpSpecializedCooperative = enum_auto()
VLLMDataTypeNames: Dict[Union[VLLMDataType, DataType], str] = {
VLLMDataTypeNames: dict[Union[VLLMDataType, DataType], str] = {
**DataTypeNames, # type: ignore
**{
VLLMDataType.u4b8: "u4b8",
@ -29,7 +29,7 @@ VLLMDataTypeNames: Dict[Union[VLLMDataType, DataType], str] = {
}
}
VLLMDataTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
VLLMDataTypeTag: dict[Union[VLLMDataType, DataType], str] = {
**DataTypeTag, # type: ignore
**{
VLLMDataType.u4b8: "cutlass::vllm_uint4b8_t",
@ -37,7 +37,7 @@ VLLMDataTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
}
}
VLLMDataTypeSize: Dict[Union[VLLMDataType, DataType], int] = {
VLLMDataTypeSize: dict[Union[VLLMDataType, DataType], int] = {
**DataTypeSize, # type: ignore
**{
VLLMDataType.u4b8: 4,
@ -45,7 +45,7 @@ VLLMDataTypeSize: Dict[Union[VLLMDataType, DataType], int] = {
}
}
VLLMDataTypeVLLMScalarTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
VLLMDataTypeVLLMScalarTypeTag: dict[Union[VLLMDataType, DataType], str] = {
VLLMDataType.u4b8: "vllm::kU4B8",
VLLMDataType.u8b128: "vllm::kU8B128",
DataType.u4: "vllm::kU4",
@ -56,7 +56,7 @@ VLLMDataTypeVLLMScalarTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
DataType.bf16: "vllm::kBfloat16",
}
VLLMDataTypeTorchDataTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
VLLMDataTypeTorchDataTypeTag: dict[Union[VLLMDataType, DataType], str] = {
DataType.u8: "at::ScalarType::Byte",
DataType.s8: "at::ScalarType::Char",
DataType.e4m3: "at::ScalarType::Float8_e4m3fn",
@ -66,7 +66,7 @@ VLLMDataTypeTorchDataTypeTag: Dict[Union[VLLMDataType, DataType], str] = {
DataType.f32: "at::ScalarType::Float",
}
VLLMKernelScheduleTag: Dict[Union[
VLLMKernelScheduleTag: dict[Union[
MixedInputKernelScheduleType, KernelScheduleType], str] = {
**KernelScheduleTag, # type: ignore
**{

View File

@ -16,6 +16,7 @@
#include "cutlass/gemm/kernel/gemm_universal.hpp"
#include "cutlass/epilogue/collective/collective_builder.hpp"
#include "cutlass/gemm/collective/collective_builder.hpp"
#include "cutlass/util/packed_stride.hpp"
#include "core/math.hpp"
#include "cutlass_extensions/common.hpp"
@ -64,22 +65,28 @@ void cutlass_gemm_caller(torch::Tensor& out, torch::Tensor const& a,
torch::Tensor const& b,
EpilogueArgs&&... epilogue_params) {
using ElementAB = typename Gemm::ElementAB;
using ElementC = typename Gemm::ElementC;
using ElementD = typename Gemm::ElementD;
using GemmKernel = typename Gemm::GemmKernel;
int64_t lda = a.stride(0);
int64_t ldb = b.stride(1);
int64_t ldc = out.stride(0);
using StrideA = cute::Stride<int64_t, cute::Int<1>, int64_t>;
using StrideB = cute::Stride<int64_t, cute::Int<1>, int64_t>;
using StrideC = typename Gemm::StrideC;
StrideA a_stride{lda, cute::Int<1>{}, 0};
StrideB b_stride{ldb, cute::Int<1>{}, 0};
StrideC c_stride{ldc, cute::Int<1>{}, cute::Int<0>{}};
using StrideA = typename Gemm::GemmKernel::StrideA;
using StrideB = typename Gemm::GemmKernel::StrideB;
using StrideC = typename Gemm::GemmKernel::StrideC;
using StrideD = StrideC;
using StrideAux = StrideC;
typename GemmKernel::ProblemShape prob_shape = get_problem_shape(a, b);
auto [M, N, K, L] = prob_shape;
StrideA a_stride =
cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
StrideB b_stride =
cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, L));
StrideC c_stride =
cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, L));
StrideD d_stride =
cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, L));
StrideAux aux_stride = d_stride;
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
@ -87,10 +94,11 @@ void cutlass_gemm_caller(torch::Tensor& out, torch::Tensor const& a,
b_stride};
auto c_ptr = static_cast<ElementD*>(out.data_ptr());
// auto d_ptr = static_cast<ElementC*>(out.data_ptr());
typename GemmKernel::EpilogueArguments epilogue_args{
Gemm::Epilogue::prepare_args(
std::forward<EpilogueArgs>(epilogue_params)...),
c_ptr, c_stride, c_ptr, c_stride};
c_ptr, c_stride, c_ptr, d_stride};
cutlass_gemm_caller<GemmKernel>(a.device(), prob_shape, mainloop_args,
epilogue_args);

View File

@ -40,12 +40,7 @@ struct cutlass_3x_gemm {
typename std::conditional<std::is_same_v<ElementAB, int8_t>, int32_t,
float>::type;
using EpilogueDescriptor =
cutlass::epilogue::collective::detail::EpilogueDescriptor<
TileShape, cutlass::epilogue::collective::EpilogueTileAuto, ElementD,
ElementD, EpilogueSchedule>;
using Epilogue = Epilogue_<ElementAcc, ElementD, EpilogueDescriptor>;
using Epilogue = Epilogue_<ElementAcc, ElementD, TileShape>;
using StrideD = Stride<int64_t, Int<1>, Int<0>>;
using ElementC = void;
@ -88,4 +83,65 @@ struct cutlass_3x_gemm {
struct GemmKernel : public KernelType {};
};
template <typename ElementAB_, typename ElementD_,
template <typename, typename, typename> typename Epilogue_,
typename TileShape, typename ClusterShape, typename KernelSchedule,
typename EpilogueSchedule>
struct cutlass_3x_gemm_sm100 {
using ElementAB = ElementAB_;
using LayoutA = cutlass::layout::RowMajor;
static constexpr int AlignmentA =
128 / cutlass::sizeof_bits<ElementAB>::value;
using LayoutB = cutlass::layout::ColumnMajor;
static constexpr int AlignmentB =
128 / cutlass::sizeof_bits<ElementAB>::value;
using ElementC = void;
using LayoutC = cutlass::layout::RowMajor;
static constexpr int AlignmentC =
128 / cutlass::sizeof_bits<ElementD_>::value;
using ElementD = ElementD_;
using LayoutD = cutlass::layout::RowMajor;
static constexpr int AlignmentD = AlignmentC;
using ElementAcc =
typename std::conditional<std::is_same_v<ElementAB, int8_t>, int32_t,
float>::type;
using Epilogue = Epilogue_<ElementAcc, ElementD, TileShape>;
// MMA type
using ElementAccumulator = float;
// Epilogue types
using ElementBias = cutlass::half_t;
using ElementCompute = float;
using ElementAux = ElementD;
using LayoutAux = LayoutD;
using ElementAmax = float;
using EVTCompute = typename Epilogue::EVTCompute;
using CollectiveEpilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp, TileShape,
ClusterShape, cutlass::epilogue::collective::EpilogueTileAuto,
ElementAccumulator, ElementCompute, ElementC, LayoutC, AlignmentC,
ElementD, LayoutD, AlignmentD, EpilogueSchedule,
EVTCompute>::CollectiveOp;
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp, ElementAB,
LayoutA, AlignmentA, ElementAB, LayoutB, AlignmentB,
ElementAccumulator, TileShape, ClusterShape,
cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(
sizeof(typename CollectiveEpilogue::SharedStorage))>,
KernelSchedule>::CollectiveOp;
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>;
};
} // namespace vllm

View File

@ -30,4 +30,10 @@ void cutlass_scaled_mm_blockwise_sm90_fp8(torch::Tensor& out,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales);
void cutlass_scaled_mm_sm100_fp8(torch::Tensor& out, torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
std::optional<torch::Tensor> const& bias);
} // namespace vllm

View File

@ -0,0 +1,24 @@
#include "scaled_mm_kernels.hpp"
#include "scaled_mm_sm100_fp8_dispatch.cuh"
#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
namespace vllm {
void cutlass_scaled_mm_sm100_fp8(torch::Tensor& out, torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
std::optional<torch::Tensor> const& bias) {
TORCH_CHECK(a_scales.is_contiguous() && b_scales.is_contiguous());
if (bias) {
TORCH_CHECK(bias->dtype() == out.dtype(),
"currently bias dtype must match output dtype ", out.dtype());
return cutlass_scaled_mm_sm100_fp8_epilogue<c3x::ScaledEpilogueBias>(
out, a, b, a_scales, b_scales, *bias);
} else {
return cutlass_scaled_mm_sm100_fp8_epilogue<c3x::ScaledEpilogue>(
out, a, b, a_scales, b_scales);
}
}
} // namespace vllm

View File

@ -0,0 +1,67 @@
#pragma once
#include "scaled_mm.cuh"
#include "cutlass_gemm_caller.cuh"
/**
* This file defines Gemm kernel configurations for SM100 (fp8) based on the
* Gemm shape.
*/
namespace vllm {
using c3x::cutlass_gemm_caller;
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm100_fp8_config_default {
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule = cutlass::gemm::collective::KernelScheduleAuto;
using EpilogueSchedule = cutlass::epilogue::collective::EpilogueScheduleAuto;
using TileShape = Shape<_256, _128, _64>;
using ClusterShape = Shape<_2, _2, _1>;
using Cutlass3xGemm =
cutlass_3x_gemm_sm100<InType, OutType, Epilogue, TileShape, ClusterShape,
KernelSchedule, EpilogueSchedule>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue,
typename... EpilogueArgs>
inline void cutlass_gemm_sm100_fp8_dispatch(torch::Tensor& out,
torch::Tensor const& a,
torch::Tensor const& b,
EpilogueArgs&&... args) {
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b.dtype() == torch::kFloat8_e4m3fn);
using Cutlass3xGemmDefault =
typename sm100_fp8_config_default<InType, OutType,
Epilogue>::Cutlass3xGemm;
return cutlass_gemm_caller<Cutlass3xGemmDefault>(
out, a, b, std::forward<EpilogueArgs>(args)...);
}
template <template <typename, typename, typename> typename Epilogue,
typename... EpilogueArgs>
void cutlass_scaled_mm_sm100_fp8_epilogue(torch::Tensor& out,
torch::Tensor const& a,
torch::Tensor const& b,
EpilogueArgs&&... epilogue_args) {
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b.dtype() == torch::kFloat8_e4m3fn);
if (out.dtype() == torch::kBFloat16) {
return cutlass_gemm_sm100_fp8_dispatch<cutlass::float_e4m3_t,
cutlass::bfloat16_t, Epilogue>(
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
} else {
TORCH_CHECK(out.dtype() == torch::kFloat16);
return cutlass_gemm_sm100_fp8_dispatch<cutlass::float_e4m3_t,
cutlass::half_t, Epilogue>(
out, a, b, std::forward<EpilogueArgs>(epilogue_args)...);
}
}
} // namespace vllm

View File

@ -71,3 +71,28 @@ void cutlass_scaled_mm_azp_sm90(torch::Tensor& out, torch::Tensor const& a,
vllm::cutlass_scaled_mm_azp_sm90_int8(out, a, b, a_scales, b_scales, azp_adj,
azp, bias);
}
#if defined CUDA_VERSION && CUDA_VERSION >= 12080
void cutlass_scaled_mm_sm100(torch::Tensor& c, torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
std::optional<torch::Tensor> const& bias) {
TORCH_CHECK(a_scales.dtype() == torch::kFloat32);
TORCH_CHECK(b_scales.dtype() == torch::kFloat32);
int M = a.size(0), N = b.size(1), K = a.size(1);
TORCH_CHECK(
(a_scales.numel() == 1 || a_scales.numel() == a.size(0)) &&
(b_scales.numel() == 1 || b_scales.numel() == b.size(1)),
"Currently, block scaled fp8 gemm is not implemented for Blackwell");
// Standard per-tensor/per-token/per-channel scaling
TORCH_CHECK(a_scales.is_contiguous() && b_scales.is_contiguous());
TORCH_CHECK(a.dtype() == torch::kFloat8_e4m3fn,
"Currently, only fp8 gemm is implemented for Blackwell");
vllm::cutlass_scaled_mm_sm100_fp8(c, a, b, a_scales, b_scales, bias);
}
#endif

View File

@ -29,6 +29,11 @@ void cutlass_scaled_mm_sm90(torch::Tensor& c, torch::Tensor const& a,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
std::optional<torch::Tensor> const& bias);
void cutlass_scaled_mm_sm100(torch::Tensor& c, torch::Tensor const& a,
torch::Tensor const& b,
torch::Tensor const& a_scales,
torch::Tensor const& b_scales,
std::optional<torch::Tensor> const& bias);
#endif
void cutlass_scaled_mm_azp_sm75(torch::Tensor& c, torch::Tensor const& a,
@ -86,7 +91,7 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
// and at least SM90 (Hopper)
#if defined CUDA_VERSION
if (cuda_device_capability >= 90) {
if (cuda_device_capability >= 90 && cuda_device_capability < 100) {
return CUDA_VERSION >= 12000;
}
#endif
@ -120,10 +125,22 @@ void cutlass_scaled_mm(torch::Tensor& c, torch::Tensor const& a,
// Guard against compilation issues for sm90 kernels
#if defined ENABLE_SCALED_MM_C3X && ENABLE_SCALED_MM_C3X
if (version_num >= 90) {
#if defined CUDA_VERSION && CUDA_VERSION < 12080
if (version_num >= 90 && version_num < 100) {
cutlass_scaled_mm_sm90(c, a, b, a_scales, b_scales, bias);
return;
}
#else
if (version_num >= 90 && version_num < 100) {
cutlass_scaled_mm_sm90(c, a, b, a_scales, b_scales, bias);
return;
} else if (version_num >= 100) {
cutlass_scaled_mm_sm100(c, a, b, a_scales, b_scales, bias);
return;
}
#endif
#endif
#if defined ENABLE_SCALED_MM_C2X && ENABLE_SCALED_MM_C2X

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,163 @@
#include "allspark_utils.cuh"
#include <torch/all.h>
#include "core/registration.h"
namespace allspark {
// Rearrange B to facilitate Ampere Tensor Core load data
// reorder B from (K, N) to (N_32align / 4, K * 4)
// K % 16 == 0, N % 16 == 0, N_32align % 32 == 0
template <typename FType>
__global__ void __launch_bounds__(128)
rearrange_kn_weight_as_n32k16_order_ldg16_kernel(
const uint8_t* B, const FType* B_scale, const FType* B_zero,
uint8_t* B_result, FType* B_scale_result, FType* B_zero_result,
const int K, const int N, const int N_32align) {
const int lane_id = threadIdx.x % 32;
const int warp_id = threadIdx.x / 32;
if (blockIdx.x != gridDim.x - 1) {
// Load B
// per block process 64(k) * 128(n) B elements
// per warp process 16(k) * 128 B elements
const int src_row_base_idx =
blockIdx.x * 64 + warp_id * 16 + ((lane_id % 8) / 2) * 2;
const int src_col_idx =
blockIdx.y * 128 + (lane_id / 8) * 32 + (lane_id % 2) * 16;
uint8_t B_frag[4][16];
#pragma unroll
for (int i = 0; i < 4; ++i) {
int src_row_idx = src_row_base_idx + (i / 2) * 8 + (i % 2);
int src_offset = src_row_idx * N + src_col_idx;
bool guard = src_row_idx < K && src_col_idx < N;
ldg128_cg_0(*reinterpret_cast<uint32_t*>(B_frag[i]),
*(reinterpret_cast<uint32_t*>(B_frag[i]) + 1),
*(reinterpret_cast<uint32_t*>(B_frag[i]) + 2),
*(reinterpret_cast<uint32_t*>(B_frag[i]) + 3), B + src_offset,
guard);
}
// reorder B
uint8_t B_reorder_frag[8][8];
#pragma unroll
for (int i = 0; i < 4; ++i) {
#pragma unroll
for (int j = 0; j < 16; ++j) {
int dst_i = j % 8;
int dst_j = i + (j / 8) * 4;
B_reorder_frag[dst_i][dst_j] = B_frag[i][j];
}
}
// Store B
const int dst_row_base_idx = blockIdx.y * (128 / 4) + (lane_id / 8) * 8;
const int dst_col_idx =
blockIdx.x * (64 * 4) + warp_id * 64 + (lane_id % 8) * 8;
for (int i = 0; i < 8; ++i) {
int dst_row_idx = dst_row_base_idx + i;
int dst_offset = dst_row_idx * K * 4 + dst_col_idx;
bool guard = (dst_row_base_idx < N_32align / 4) && (dst_col_idx < K * 4);
if (guard) {
*reinterpret_cast<int2*>(B_result + dst_offset) =
*reinterpret_cast<int2*>(B_reorder_frag[i]);
}
}
} else {
// Load B_scale and B_zero
FType b_scale_reg, b_zero_reg;
int src_offset = blockIdx.y * 128 + threadIdx.x;
ldg16_cg_0(b_scale_reg, B_scale + src_offset, src_offset < N);
if (B_zero != nullptr)
ldg16_cg_0(b_zero_reg, B_zero + src_offset, src_offset < N);
int dst_offset =
blockIdx.y * 128 + warp_id * 32 + (lane_id % 8) * 4 + lane_id / 8;
if (dst_offset < N_32align) {
B_scale_result[dst_offset] = b_scale_reg;
if (B_zero != nullptr) B_zero_result[dst_offset] = b_zero_reg;
}
}
}
template <typename FType>
void rearrange_kn_weight_as_n32k16_order_ldg16(
const uint8_t* B, const FType* B_scale, const FType* B_zero,
uint8_t* B_result, FType* B_scale_result, FType* B_zero_result,
const int64_t K, const int64_t N, const int64_t N_32align,
cudaStream_t stream) {
if (N % 16 != 0 || K % 16 != 0) {
std::cerr << "Now only support N and K is multiples of 16" << std::endl;
}
const int BLOCK = 128;
int grid_x = (K + 64 - 1) / 64 + 1;
int grid_y = (N + 128 - 1) / 128;
dim3 grid(grid_x, grid_y);
rearrange_kn_weight_as_n32k16_order_ldg16_kernel<FType>
<<<grid, BLOCK, 0, stream>>>(B, B_scale, B_zero, B_result, B_scale_result,
B_zero_result, K, N, N_32align);
}
} // namespace allspark
void rearrange_kn_weight_as_n32k16_order(
torch::Tensor const& b_qweight, torch::Tensor const& b_scales,
c10::optional<torch::Tensor> const& b_zeros, bool has_zp,
torch::Tensor& b_qweight_reorder, torch::Tensor& b_scales_reorder,
c10::optional<torch::Tensor> const& b_zeros_reorder, const int64_t K,
const int64_t N, const int64_t N_32align) {
// Verify device and strides
TORCH_CHECK(b_qweight.device().is_cuda(), "b_qweight is not on GPU");
TORCH_CHECK(b_qweight.is_contiguous(), "b_qweight is not contiguous");
TORCH_CHECK(b_scales.device().is_cuda(), "b_scales is not on GPU");
TORCH_CHECK(b_scales.is_contiguous(), "b_scales is not contiguous");
TORCH_CHECK(b_qweight_reorder.device().is_cuda(),
"b_qweight_reorder is not on GPU");
TORCH_CHECK(b_qweight_reorder.is_contiguous(),
"b_qweight_reorder is not contiguous");
TORCH_CHECK(b_scales_reorder.device().is_cuda(),
"b_scales_reorder is not on GPU");
TORCH_CHECK(b_scales_reorder.is_contiguous(),
"b_scales_reorder is not contiguous");
if (has_zp) {
TORCH_CHECK(b_zeros.value().device().is_cuda(), "b_zeros is not on GPU");
TORCH_CHECK(b_zeros.value().is_contiguous(), "b_zeros is not contiguous");
TORCH_CHECK(b_zeros_reorder.value().device().is_cuda(),
"b_zeros_reorder is not on GPU");
TORCH_CHECK(b_zeros_reorder.value().is_contiguous(),
"b_zeros_reorder is not contiguous");
}
const uint8_t* matB = reinterpret_cast<const uint8_t*>(b_qweight.data_ptr());
const void* b_scale = b_scales.data_ptr();
const void* b_zero = has_zp ? b_zeros.value().data_ptr() : nullptr;
uint8_t* matB_reorder =
reinterpret_cast<uint8_t*>(b_qweight_reorder.data_ptr());
void* b_scale_reorder = b_scales_reorder.data_ptr();
void* b_zero_reorder = has_zp ? b_zeros_reorder.value().data_ptr() : nullptr;
cudaStream_t stream = at::cuda::getCurrentCUDAStream();
if (b_scales.dtype() == at::ScalarType::Half) {
allspark::rearrange_kn_weight_as_n32k16_order_ldg16<__half>(
matB, reinterpret_cast<const __half*>(b_scale),
reinterpret_cast<const __half*>(b_zero), matB_reorder,
reinterpret_cast<__half*>(b_scale_reorder),
reinterpret_cast<__half*>(b_zero_reorder), K, N, N_32align, stream);
} else if (b_scales.dtype() == at::ScalarType::BFloat16) {
allspark::rearrange_kn_weight_as_n32k16_order_ldg16<__nv_bfloat16>(
matB, reinterpret_cast<const __nv_bfloat16*>(b_scale),
reinterpret_cast<const __nv_bfloat16*>(b_zero), matB_reorder,
reinterpret_cast<__nv_bfloat16*>(b_scale_reorder),
reinterpret_cast<__nv_bfloat16*>(b_zero_reorder), K, N, N_32align,
stream);
}
}
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
m.impl("rearrange_kn_weight_as_n32k16_order",
&rearrange_kn_weight_as_n32k16_order);
}

View File

@ -0,0 +1,408 @@
#pragma once
#include <torch/all.h>
#include <c10/cuda/CUDAGuard.h>
#include <ATen/cuda/CUDAContext.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cuda_bf16.h>
#include <iostream>
namespace allspark {
#define CHECK_CUDA(cmd) \
do { \
cudaError_t cuda_status = cmd; \
if (cuda_status != cudaSuccess) { \
std::string err_str = cudaGetErrorString(cuda_status); \
std::cerr << "Failed: " << __FILE__ << ":" << __LINE__ << " " \
<< err_str; \
exit(-1); \
} \
} while (0)
#define CHECK_CUBLAS(cmd) \
do { \
cublasStatus_t cublas_status = cmd; \
if (cublas_status != CUBLAS_STATUS_SUCCESS) { \
std::cerr << "Failed: " << __FILE__ << ":" << __LINE__ << " " \
<< cublas_status << std::endl; \
exit(-1); \
} \
} while (0)
template <typename FType, typename QType>
struct SM8x_GEMM_W8A16_Splitk_Params {
const FType* A_ptr;
const QType* B_ptr;
const FType* B_scale_ptr;
const FType* B_zero_ptr;
FType* C_ptr;
int M;
int N;
int K;
int SplitK;
int GroupCnt;
int GroupSize;
FType* C_split_ptr; // for non-fused splitk reduce
float* C_tmp_ptr; // for fused splitk reduce
uint32_t* red_count_ptr; // for fused splitk reduce
};
struct alignas(16) BlockTileSplitkParams {
int Mtile;
int Ntile;
int SplitK;
bool EnableFuse;
};
template <typename FType, int BLOCK, int N_MATRIX>
__global__ void f16_gemm_splitk_reduce_kernel(const FType* C_split, FType* C,
uint32_t n, uint32_t n_matrix,
uint32_t matrix_size) {
int idx = blockIdx.x * BLOCK + threadIdx.x;
if (idx >= matrix_size) {
return;
}
FType sum(0);
int n_mat = N_MATRIX > 0 ? N_MATRIX : (int)n_matrix;
for (int i = 0; i < n_mat; ++i) {
sum += C_split[idx + i * matrix_size];
}
C[idx] = sum;
}
template <typename FType>
void f16_gemm_splitk_reduce(const FType* C_split, FType* C, const uint32_t m,
const uint32_t n, const uint32_t n_matrix,
cudaStream_t stream) {
const int BLOCK = 128;
uint32_t matrix_size = m * n;
int grid = (matrix_size + BLOCK - 1) / BLOCK;
void (*kernel)(const FType*, FType*, uint32_t, uint32_t, uint32_t) = nullptr;
switch (n_matrix) {
case 4:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 4>;
break;
case 5:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 5>;
break;
case 6:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 6>;
break;
case 7:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 7>;
break;
case 8:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 8>;
break;
case 9:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 9>;
break;
case 10:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 10>;
break;
case 11:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 11>;
break;
case 12:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, 12>;
break;
default:
kernel = f16_gemm_splitk_reduce_kernel<FType, BLOCK, -1>;
break;
}
kernel<<<grid, BLOCK, 0, stream>>>(C_split, C, n, n_matrix, matrix_size);
}
template <typename T>
struct HalfType;
template <>
struct HalfType<half> {
using T1 = __half;
using T2 = __half2;
};
template <>
struct HalfType<__nv_bfloat16> {
using T1 = __nv_bfloat16;
using T2 = __nv_bfloat162;
};
// convert 64-bit pointer to 32-bit smem addr
__device__ __forceinline__ uint32_t smem_u32addr(const void* smem_ptr) {
uint32_t addr;
asm("{.reg .u64 u64addr;\n"
" cvta.to.shared.u64 u64addr, %1;\n"
" cvt.u32.u64 %0, u64addr;}\n"
: "=r"(addr)
: "l"(smem_ptr));
return addr;
}
template <typename T>
__device__ __forceinline__ void ldg16_cg_0(T& r0, const void* ptr, bool guard) {
static_assert(sizeof(T) == 2, "ldg16_cg_0: invalid T");
asm volatile(
"{.reg .pred p;\n"
" setp.ne.b32 p, %2, 0;\n"
" @!p mov.b16 %0, 0;\n"
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDACC_VER_MINOR__ >= 4 && \
__CUDA_ARCH__ >= 750
" @p ld.global.cg.L2::128B.b16 {%0}, [%1];}\n"
#else
" @p ld.global.ca.b16 {%0}, [%1];}\n"
#endif
: "=h"(reinterpret_cast<uint16_t&>(r0))
: "l"(ptr), "r"((int)guard));
}
template <typename T>
__device__ __forceinline__ void ldg64_ca(T& r0, T& r1, const void* ptr,
bool guard) {
static_assert(sizeof(T) == 4, "ldg64_ca: invalid T");
asm volatile(
"{.reg .pred p;\n"
" setp.ne.b32 p, %3, 0;\n"
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDACC_VER_MINOR__ >= 4 && \
__CUDA_ARCH__ >= 750
" @p ld.global.ca.L2::128B.v2.b32 {%0, %1}, [%2];}\n"
#else
" @p ld.global.ca.v2.b32 {%0, %1}, [%2];}\n"
#endif
: "=r"(reinterpret_cast<uint32_t&>(r0)),
"=r"(reinterpret_cast<uint32_t&>(r1))
: "l"(ptr), "r"((int)guard));
}
template <typename T>
__device__ __forceinline__ void ldg128_cg_0(T& r0, T& r1, T& r2, T& r3,
const void* ptr, bool guard) {
static_assert(sizeof(T) == 4, "ldg128_cg_0: invalid T");
asm volatile(
"{.reg .pred p;\n"
" setp.ne.b32 p, %5, 0;\n"
" @!p mov.b32 %0, 0;\n"
" @!p mov.b32 %1, 0;\n"
" @!p mov.b32 %2, 0;\n"
" @!p mov.b32 %3, 0;\n"
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDACC_VER_MINOR__ >= 4 && \
__CUDA_ARCH__ >= 750
" @p ld.global.cg.L2::128B.v4.b32 {%0, %1, %2, %3}, [%4];}\n"
#else
" @p ld.global.cg.v4.b32 {%0, %1, %2, %3}, [%4];}\n"
#endif
: "=r"(reinterpret_cast<uint32_t&>(r0)),
"=r"(reinterpret_cast<uint32_t&>(r1)),
"=r"(reinterpret_cast<uint32_t&>(r2)),
"=r"(reinterpret_cast<uint32_t&>(r3))
: "l"(ptr), "r"((int)guard));
}
template <typename T>
__device__ __forceinline__ void lds128(T& reg0, T& reg1, T& reg2, T& reg3,
const uint32_t addr) {
static_assert(sizeof(T) == 4, "lds128: invalid T");
asm volatile("ld.shared.v4.b32 {%0, %1, %2, %3}, [%4];\n"
: "=r"(reinterpret_cast<uint32_t&>(reg0)),
"=r"(reinterpret_cast<uint32_t&>(reg1)),
"=r"(reinterpret_cast<uint32_t&>(reg2)),
"=r"(reinterpret_cast<uint32_t&>(reg3))
: "r"(addr));
}
template <typename T>
__device__ __forceinline__ void stg128(const T& r0, const T& r1, const T& r2,
const T& r3, const void* ptr,
bool guard) {
static_assert(sizeof(T) == 4, "stg128: invalid T");
asm volatile(
"{.reg .pred p;\n"
" setp.ne.b32 p, %1, 0;\n"
" @p st.global.v4.b32 [%0], {%2, %3, %4, %5};}\n"
:
: "l"(ptr), "r"((int)guard), "r"(reinterpret_cast<const uint32_t&>(r0)),
"r"(reinterpret_cast<const uint32_t&>(r1)),
"r"(reinterpret_cast<const uint32_t&>(r2)),
"r"(reinterpret_cast<const uint32_t&>(r3)));
}
template <typename T>
__device__ __forceinline__ void ldsm_4(T& r0, T& r1, T& r2, T& r3,
const uint32_t& addr) {
static_assert(sizeof(T) == 4, "ldsm_4: invalid T");
#if (__CUDA_ARCH__ >= 750) && (__CUDACC_VER_MAJOR__ >= 11)
asm volatile(
"ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%0, %1, %2, %3}, [%4];\n"
: "=r"(reinterpret_cast<uint32_t&>(r0)),
"=r"(reinterpret_cast<uint32_t&>(r1)),
"=r"(reinterpret_cast<uint32_t&>(r2)),
"=r"(reinterpret_cast<uint32_t&>(r3))
: "r"(addr));
#endif
}
template <typename FType>
__device__ __forceinline__ void hmma16816_f32(float (&d)[4],
const uint32_t (&a)[4],
const uint32_t (&b)[2]);
template <>
__device__ __forceinline__ void hmma16816_f32<__half>(float (&d)[4],
const uint32_t (&a)[4],
const uint32_t (&b)[2]) {
#if (__CUDA_ARCH__ >= 800) && (__CUDACC_VER_MAJOR__ >= 11)
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 {%0, %1, %2, %3}, "
"{%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3};\n"
: "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3])
: "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]));
#endif
}
template <>
__device__ __forceinline__ void hmma16816_f32<__nv_bfloat16>(
float (&d)[4], const uint32_t (&a)[4], const uint32_t (&b)[2]) {
#if (__CUDA_ARCH__ >= 800) && (__CUDACC_VER_MAJOR__ >= 11)
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 {%0, %1, %2, %3}, "
"{%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3};\n"
: "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3])
: "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]));
#endif
}
template <int SIZE_IN_BYTES>
__device__ __forceinline__ void cp_async(const uint32_t smem_addr,
const void* gmem_ptr,
const int src_in_bytes, bool guard) {
static_assert(
(SIZE_IN_BYTES == 4 || SIZE_IN_BYTES == 8 || SIZE_IN_BYTES == 16),
"Size is not supported");
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDA_ARCH__ >= 800
asm volatile(
"{.reg.pred p;\n"
" setp.ne.b32 p, %4, 0;\n"
#if __CUDACC_VER_MINOR__ >= 4
" @p cp.async.cg.shared.global.L2::256B [%0], [%1], %2, %3;}\n"
#else
" @p cp.async.cg.shared.global [%0], [%1], %2, %3;}\n"
#endif
::"r"(smem_addr),
"l"(gmem_ptr), "n"(SIZE_IN_BYTES), "r"(src_in_bytes), "r"((int)guard));
#endif
}
template <int SIZE_IN_BYTES>
__device__ __forceinline__ void cp_async_ca(const uint32_t smem_addr,
const void* gmem_ptr,
const int src_in_bytes,
bool guard) {
static_assert(
(SIZE_IN_BYTES == 4 || SIZE_IN_BYTES == 8 || SIZE_IN_BYTES == 16),
"Size is not supported");
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDA_ARCH__ >= 800
asm volatile(
"{.reg.pred p;\n"
" setp.ne.b32 p, %4, 0;\n"
#if __CUDACC_VER_MINOR__ >= 4
" @p cp.async.ca.shared.global.L2::256B [%0], [%1], %2, %3;}\n"
#else
" @p cp.async.ca.shared.global [%0], [%1], %2, %3;}\n"
#endif
::"r"(smem_addr),
"l"(gmem_ptr), "n"(SIZE_IN_BYTES), "r"(src_in_bytes), "r"((int)guard));
#endif
}
__device__ __forceinline__ void cp_async_commit_group() {
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDA_ARCH__ >= 800
asm volatile("cp.async.commit_group;\n");
#endif
}
template <int N>
__device__ __forceinline__ void cp_asyc_wait_group() {
#if __CUDACC_VER_MAJOR__ >= 11 && __CUDA_ARCH__ >= 800
asm volatile("cp.async.wait_group %0;\n" : : "n"(N));
#endif
}
template <typename T>
__device__ __forceinline__ void cvt_8bx4_to_16bx4_bias128(const uint32_t& idata,
T* fdata);
template <>
// fast conversion: 4xuint8 to 4xhalf, subtracting bias = 128
__device__ __forceinline__ void cvt_8bx4_to_16bx4_bias128<__half2>(
const uint32_t& idata, __half2* fdata) {
uint32_t i10, i32;
asm volatile(
"prmt.b32 %0, %2, 0x64, 0x4140;"
"prmt.b32 %1, %2, 0x64, 0x4342;"
: "=r"(i10), "=r"(i32)
: "r"(idata));
static constexpr uint32_t MAGIC_NUM = 0x64806480;
fdata[0] = __hsub2(reinterpret_cast<const __half2&>(i10),
reinterpret_cast<const __half2&>(MAGIC_NUM));
fdata[1] = __hsub2(reinterpret_cast<const __half2&>(i32),
reinterpret_cast<const __half2&>(MAGIC_NUM));
}
template <>
// fast conversion: 4xuint8 to 4xbfloat16, subtracting bias = 128
// reference from marlin fast implementation
__device__ __forceinline__ void cvt_8bx4_to_16bx4_bias128<__nv_bfloat162>(
const uint32_t& idata, __nv_bfloat162* fdata) {
float fp32_imd[4];
uint32_t* fp32_imd_casted = reinterpret_cast<uint32_t*>(fp32_imd);
asm volatile(
"prmt.b32 %0, %4, 0x4B000000, 0x7650;"
"prmt.b32 %1, %4, 0x4B000000, 0x7651;"
"prmt.b32 %2, %4, 0x4B000000, 0x7652;"
"prmt.b32 %3, %4, 0x4B000000, 0x7653;"
: "=r"(fp32_imd_casted[0]), "=r"(fp32_imd_casted[1]),
"=r"(fp32_imd_casted[2]), "=r"(fp32_imd_casted[3])
: "r"(idata));
fp32_imd[0] -= 8388736.f;
fp32_imd[1] -= 8388736.f;
fp32_imd[2] -= 8388736.f;
fp32_imd[3] -= 8388736.f;
uint32_t* bf16_res = reinterpret_cast<uint32_t*>(fdata);
asm volatile(
"prmt.b32 %0, %2, %3, 0x7632;"
"prmt.b32 %1, %4, %5, 0x7632;"
: "=r"(bf16_res[0]), "=r"(bf16_res[1])
: "r"(fp32_imd_casted[0]), "r"(fp32_imd_casted[1]),
"r"(fp32_imd_casted[2]), "r"(fp32_imd_casted[3]));
}
static __device__ nv_bfloat162 inline num2num2(const nv_bfloat16 x) {
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 800
assert(false);
#else
return __bfloat162bfloat162(x);
#endif
__builtin_unreachable(); // Suppress missing return statement warning
}
static __device__ half2 inline num2num2(const half x) {
return __half2half2(x);
}
} // namespace allspark

View File

@ -8,7 +8,7 @@ from collections.abc import Iterable
from copy import deepcopy
from dataclasses import dataclass, fields
from functools import reduce
from typing import Dict, List, Optional, Tuple, Union
from typing import Optional, Union
import jinja2
# yapf conflicts with isort for this block
@ -247,8 +247,8 @@ TmaCoop = EpilogueScheduleType.TmaWarpSpecializedCooperative
@dataclass(frozen=True)
class ScheduleConfig:
tile_shape_mn: Tuple[int, int]
cluster_shape_mnk: Tuple[int, int, int]
tile_shape_mn: tuple[int, int]
cluster_shape_mnk: tuple[int, int, int]
kernel_schedule: MixedInputKernelScheduleType
epilogue_schedule: EpilogueScheduleType
tile_scheduler: TileSchedulerType
@ -277,8 +277,8 @@ class PrepackTypeConfig:
@dataclass
class ImplConfig:
types: TypeConfig
schedules: List[ScheduleConfig]
heuristic: List[Tuple[Optional[str], ScheduleConfig]]
schedules: list[ScheduleConfig]
heuristic: list[tuple[Optional[str], ScheduleConfig]]
def generate_sch_sig(schedule_config: ScheduleConfig) -> str:
@ -333,7 +333,7 @@ def is_power_of_two(n):
return (n != 0) and (n & (n - 1) == 0)
def to_cute_constant(value: List[int]):
def to_cute_constant(value: list[int]):
def _to_cute_constant(value: int):
if is_power_of_two(value):
@ -347,7 +347,7 @@ def to_cute_constant(value: List[int]):
return _to_cute_constant(value)
def unique_schedules(impl_configs: List[ImplConfig]):
def unique_schedules(impl_configs: list[ImplConfig]):
return list(
set(sch for impl_config in impl_configs
for sch in impl_config.schedules))
@ -391,7 +391,7 @@ mm_impl_template = create_template(IMPL_TEMPLATE)
prepack_dispatch_template = create_template(PREPACK_TEMPLATE)
def create_sources(impl_configs: List[ImplConfig], num_impl_files=8):
def create_sources(impl_configs: list[ImplConfig], num_impl_files=8):
sources = []
sources.append((
@ -435,7 +435,7 @@ def create_sources(impl_configs: List[ImplConfig], num_impl_files=8):
num_impls = reduce(lambda x, y: x + len(y.schedules), impl_configs, 0)
num_impls_per_file = math.ceil(num_impls / num_impl_files)
files_impls: List[List[ImplConfig]] = [[]]
files_impls: list[list[ImplConfig]] = [[]]
curr_num_impls_assigned = 0
curr_impl_in_file = 0
@ -515,7 +515,7 @@ def generate():
for cond, tile_config in default_tile_heuristic_config.items()
]
def get_unique_schedules(heuristic: Dict[str, ScheduleConfig]):
def get_unique_schedules(heuristic: dict[str, ScheduleConfig]):
# Do not use schedules = list(set(...)) because we need to make sure
# the output list is deterministic; otherwise the generated kernel file
# will be non-deterministic and causes ccache miss.

View File

@ -126,15 +126,10 @@ struct MacheteKernelTemplate {
std::is_same_v<ElementSChannel, ElementSToken>),
"Currently token and channel scales (if present) must be the same type");
using EpilogueDescriptor =
cutlass::epilogue::collective::detail::EpilogueDescriptor<
TileShape, cutlass::epilogue::collective::EpilogueTileAuto, ElementD,
ElementD, EpilogueSchedule>;
// Currently only supports float scales
using ChTokScalesEpilogue =
typename vllm::c3x::ScaledEpilogue<ElementAccumulator, ElementD,
EpilogueDescriptor>;
TileShape>;
static_assert((with_channel_scales || with_token_scales) ||
(std::is_same_v<ElementSChannel, float> &&
std::is_same_v<ElementSToken, float>),

File diff suppressed because it is too large Load Diff

View File

@ -65,12 +65,7 @@ struct cutlass_sparse_3x_gemm {
typename std::conditional<std::is_same_v<ElementAB, int8_t>, int32_t,
float>::type;
using EpilogueDescriptor =
cutlass::epilogue::collective::detail::EpilogueDescriptor<
TileShape, cutlass::epilogue::collective::EpilogueTileAuto, ElementD,
ElementD, EpilogueSchedule>;
using Epilogue = Epilogue_<ElementAcc, ElementD, EpilogueDescriptor>;
using Epilogue = Epilogue_<ElementAcc, ElementD, TileShape>;
using ElementC = void;
using LayoutC = cutlass::layout::RowMajor;

View File

@ -447,6 +447,25 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
"Tensor!? azp) -> ()");
ops.impl("dynamic_scaled_int8_quant", torch::kCUDA,
&dynamic_scaled_int8_quant);
#ifndef USE_ROCM
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
ops.def(
"rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, "
"Tensor? b_zeros, "
"bool has_zp, Tensor! b_qweight_reorder, Tensor! b_scales_reorder, "
"Tensor!? b_zeros_reorder, "
"int K, int N, int N_32align) -> ()");
// conditionally compiled so impl in source file
// AllSpark quantization ops
ops.def(
"allspark_w8a16_gemm(Tensor a, Tensor b_qweight, Tensor b_scales, "
"Tensor? b_qzeros, "
"SymInt n, SymInt group_size, SymInt sm_count, SymInt sm_version, SymInt "
"CUBLAS_M_THRESHOLD, bool has_zp, bool n32k16_reorder) -> Tensor");
// conditionally compiled so impl in source file
#endif
}
TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {

Binary file not shown.

Before

Width:  |  Height:  |  Size: 115 KiB

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 185 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 162 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

View File

@ -17,7 +17,6 @@ import inspect
import logging
import os
import sys
from typing import List
import requests
from sphinx.ext import autodoc
@ -58,7 +57,7 @@ templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns: List[str] = ["**/*.template.md", "**/*.inc.md"]
exclude_patterns: list[str] = ["**/*.template.md", "**/*.inc.md"]
# Exclude the prompt "$" when copying code
copybutton_prompt_text = r"\$ "

View File

@ -859,7 +859,7 @@ prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
)
```
To accommodate this, instead of a string you can return an instance of `PromptUpdateDetails`
To accommodate this, instead of a string you can return an instance of {class}`~vllm.multimodal.processing.PromptUpdateDetails`
with different `full` and `feature` attributes:
```python
@ -948,3 +948,35 @@ to register them to the multi-modal registry:
+ dummy_inputs=YourDummyInputsBuilder)
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
## Notes
### Inserting feature tokens without replacement
Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use {class}`~vllm.multimodal.processing.PromptInsertion` instead of {class}`~vllm.multimodal.processing.PromptReplacement` inside {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
Examples:
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
### Handling prompt updates unrelated to multi-modal data
{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only` so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](#mm-processing).
Examples:
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py>
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py>
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py>
### Custom HF processor
Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor`.
Examples:
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py>
- InternVL: <gh-file:vllm/model_executor/models/internvl.py>
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py>

View File

@ -7,4 +7,5 @@ kserve
kubeai
llamastack
llmaz
production-stack
:::

View File

@ -0,0 +1,154 @@
(deployment-production-stack)=
# Production stack
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
* **Upstream vLLM compatibility** It wraps around upstream vLLM without modifying its code.
* **Ease of use** Simplified deployment via Helm charts and observability through Grafana dashboards.
* **High performance** Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.
If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!
## Pre-requisite
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
## Deployment using vLLM production stack
The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server.
To install the vLLM production stack, run the following commands on your desktop:
```bash
sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
```
This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model).
### Validate Installation
Monitor the deployment status using:
```bash
sudo kubectl get pods
```
And you will see that pods for the `vllm` deployment will transit to `Running` state.
```text
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
```
**NOTE**: It may take some time for the containers to download the Docker images and LLM weights.
### Send a Query to the Stack
Forward the `vllm-router-service` port to the host machine:
```bash
sudo kubectl port-forward svc/vllm-router-service 30080:80
```
And then you can send out a query to the OpenAI-compatible API to check the available models:
```bash
curl -o- http://localhost:30080/models
```
Expected output:
```json
{
"object": "list",
"data": [
{
"id": "facebook/opt-125m",
"object": "model",
"created": 1737428424,
"owned_by": "vllm",
"root": null
}
]
}
```
To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
```bash
curl -X POST http://localhost:30080/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time,",
"max_tokens": 10
}'
```
Expected output:
```json
{
"id": "completion-id",
"object": "text_completion",
"created": 1737428424,
"model": "facebook/opt-125m",
"choices": [
{
"text": " there was a brave knight who...",
"index": 0,
"finish_reason": "length"
}
]
}
```
### Uninstall
To remove the deployment, run:
```bash
sudo helm uninstall vllm
```
------
### (Advanced) Configuring vLLM production stack
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
```yaml
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "10Gi"
```
In this YAML configuration:
* **`modelSpec`** includes:
* `name`: A nickname that you prefer to call the model.
* `repository`: Docker repository of vLLM.
* `tag`: Docker image tag.
* `modelURL`: The LLM model that you want to use.
* **`replicaCount`**: Number of replicas.
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
* **`requestGPU`**: Specifies the number of GPUs required.
* **`pvcStorage`**: Allocates persistent storage for the model.
**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!

View File

@ -2,17 +2,21 @@
# Using Kubernetes
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
## Prerequisites
--------
Before you begin, ensure that you have the following:
Alternatively, you can also deploy Kubernetes using [helm chart](https://docs.vllm.ai/en/latest/deployment/frameworks/helm.html). There are also open-source projects available to make your deployment even smoother.
- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster
* [vLLM production-stack](https://github.com/vllm-project/production-stack): Born out of a Berkeley-UChicago collaboration, vLLM production stack is a project that contains latest research and community effort, while still delivering production-level stability and performance. Checkout the [documentation page](https://docs.vllm.ai/en/latest/deployment/integrations/production-stack.html) for more details and examples.
## Deployment Steps
--------
## Pre-requisite
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
## Deployment using native K8s
1. Create a PVC, Secret and Deployment for vLLM

View File

@ -95,14 +95,14 @@ Notes:
- If you have your HuggingFace models cached somewhere else, update `hf_cache_dir` below.
- If you don't have an existing HuggingFace cache you will want to start `vllm0` and wait for the model to complete downloading and the server to be ready. This will ensure that `vllm1` can leverage the model you just downloaded and it won't have to be downloaded again.
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus all`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
```console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
:::{note}

View File

@ -0,0 +1,712 @@
# Metrics
Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
## Objectives
- Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
## Background
Metrics in vLLM can be categorized as follows:
1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histrograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
### v0 Metrics
In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
- `vllm:num_requests_running` (Gauge)
- `vllm:num_requests_swapped` (Gauge)
- `vllm:num_requests_waiting` (Gauge)
- `vllm:gpu_cache_usage_perc` (Gauge)
- `vllm:cpu_cache_usage_perc` (Gauge)
- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
- `vllm:prompt_tokens_total` (Counter)
- `vllm:generation_tokens_total` (Counter)
- `vllm:request_success_total` (Counter)
- `vllm:request_prompt_tokens` (Histogram)
- `vllm:request_generation_tokens` (Histogram)
- `vllm:time_to_first_token_seconds` (Histogram)
- `vllm:time_per_output_token_seconds` (Histogram)
- `vllm:e2e_request_latency_seconds` (Histogram)
- `vllm:request_queue_time_seconds` (Histogram)
- `vllm:request_inference_time_seconds` (Histogram)
- `vllm:request_prefill_time_seconds` (Histogram)
- `vllm:request_decode_time_seconds` (Histogram)
- `vllm:request_max_num_generation_tokens` (Histogram)
- `vllm:num_preemptions_total` (Counter)
- `vllm:cache_config_info` (Gauge)
- `vllm:lora_requests_info` (Gauge)
- `vllm:tokens_total` (Counter)
- `vllm:iteration_tokens_total` (Histogram)
- `vllm:time_in_queue_requests` (Histogram)
- `vllm:model_forward_time_milliseconds` (Histogram
- `vllm:model_execute_time_milliseconds` (Histogram)
- `vllm:request_params_n` (Histogram)
- `vllm:request_params_max_tokens` (Histogram)
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
- `vllm:spec_decode_efficiency` (Gauge)
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
### Grafana Dashboard
vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds
- `vllm:prompt_tokens_total` - Prompt Tokens/Sec
- `vllm:generation_tokens_total` - Generation Tokens/Sec
- `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second.
- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state
- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
- `vllm:request_prompt_tokens` - Request prompt length
- `vllm:request_generation_tokens` - request generation length
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
- `vllm:request_queue_time_seconds` - Queue Time
- `vllm:request_prefill_time_seconds` - Requests Prefill Time
- `vllm:request_decode_time_seconds` - Requests Decode Time
- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
### Prometheus Client Library
Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
### Multi-process Mode
In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See <gh-pr:7279>.
### Built in Python/Process Metrics
The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used:
- `python_gc_objects_collected_total`
- `python_gc_objects_uncollectable_total`
- `python_gc_collections_total`
- `python_info`
- `process_virtual_memory_bytes`
- `process_resident_memory_bytes`
- `process_start_time_seconds`
- `process_cpu_seconds_total`
- `process_open_fds`
- `process_max_fds`
This is relevant because if we move away from multiprocess mode in v1,
we get these back. However, it's questionable how relevant these are
if they don't aggregate these stats for all processes that make up a
vLLM instance.
### v0 PRs and Issues
For background, these are some of the relevant PRs which added the v0 metrics:
- <gh-pr:1890>
- <gh-pr:2316>
- <gh-pr:2730>
- <gh-pr:4464>
- <gh-pr:7279>
Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
## v1 Design
### v1 PRs
For background, here are the relevant v1 PRs relating to the v1
metrics issue <gh-issue:10582>:
- <gh-pr:11962>
- <gh-pr:11973>
- <gh-pr:10907>
- <gh-pr:12416>
- <gh-pr:12478>
- <gh-pr:12516>
- <gh-pr:12530>
- <gh-pr:12561>
- <gh-pr:12579>
- <gh-pr:12592>
- <gh-pr:12644>
### Metrics Collection
In v1, we wish to move computation and overhead out of the engine core
process to minimize the time between each forward pass.
The overall idea of V1 EngineCore design is:
- EngineCore is the inner loop. Performance is most critical here
- AsyncLLM is the outer loop. This is overlapped with GPU execution
(ideally), so this is where any "overheads" should be if
possible. So AsyncLLM.output_handler_loop is the ideal place for the
metrics bookkeeping if possible.
We will achieve this by collecting metrics in the frontend API server,
and base these metrics on information we can glean from the
`EngineCoreOutputs` returned by the engine core process to the
frontend.
### Interval Calculations
Many of our metrics are the time interval between various events in
the processing of a request. It is best practice to use timestamps
based on "monotonic time" (`time.monotonic()`) rather than "wall-clock
time" (`time.time()`) to calculate intervals as the former is
unaffected by system clock changes (e.g. from NTP).
It's also important to note that monotonic clocks differ between
processes - each process has its own reference. point. So it is
meaningless to compare monotonic timestamps from different processes.
Therefore, in order to calculate an interval, we must compare two
monotonic timestamps from the same process.
### Scheduler Stats
The engine core process will collect some key statistics from the
scheduler - e.g. the number of requests that were scheduled or waiting
after the last scheduler pass - and include those statistics in
`EngineCoreOutputs`.
### Engine Core Events
The engine core will also record the timestamp of certain per-request
events so that the frontend can calculate the interval between these
events.
The events are:
- `QUEUED` - when the request was received by the engine core and
added to the scheduler queue.
- `SCHEDULED` - when the request was first scheduled for execution.
- `PREEMPTED` - the request has been put back in the waiting queue
in order to make room for other requests to complete. It will be
re-scheduled in future and re-start its prefill phase.
- `NEW_TOKENS` - when the output included in `EngineCoreOutput` was
generated. Since this is common to all requests in a given
iteration, we use a single timestamp on `EngineCoreOutputs` to
record this event.
And the calculated intervals are:
- Queue interval - between `QUEUED` and most recent `SCHEDULED`.
- Prefill interval - between most recent `SCHEDULED` and the subsequent
first `NEW_TOKENS`.
- Decode interval - between first (after the most recent `SCHEDULED`) and
last `NEW_TOKENS`.
- Inference interval - between most recent `SCHEDULED` and last `NEW_TOKENS`.
- Inter-token interval - between successive `NEW_TOKENS`.
Put another way:
:::{image} /assets/design/v1/metrics/intervals-1.png
:alt: Interval calculations - common case
:::
We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However,
the frontend does not have visibility into the timing of the `QUEUED`
and `SCHEDULED` events and, since we need to calculate intervals based
on monotonic timestamps from the same process ... we need the engine
core to record timestamps for all of these events.
#### Interval Calculations vs Preemptions
When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals.
:::{image} /assets/design/v1/metrics/intervals-2.png
:alt: Interval calculations - preempted decode
:::
When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals.
:::{image} /assets/design/v1/metrics/intervals-3.png
:alt: Interval calculations - preempted prefill
:::
### Frontend Stats Collection
As the frontend processes a single `EngineCoreOutputs` - i.e. the
output from a single engine core iteration - it collects various
statistics relating to that iteration:
- The total number of new tokens generated in this iteration.
- The total number of prompt tokens processed by the prefills that
completed in this iteration.
- The queue intervals for any requests that were scheduled in this
iteration.
- The prefill intervals for any requests that completed prefill in
this iteration.
- The inter-token intervals (Time Per Output Token, TPOT), for all
requests included in this iteration.
- The Time-To-First-Token (TTFT) for any requests that completed
prefill in this iteration. However, we calculate this interval
relative to when the request was first received by the frontend
(`arrival_time`) in order to account for input processing time.
For any requests that were completed in a given iteration, we also
record:
- The inference and decode intervals - relative to the scheduled and
first token events, as described above.
- End-to-end latency - the interval between frontend `arrival_time`
and the frontend receiving the final token.
### Metrics Publishing - Logging
The `LoggingStatLogger` metrics publisher outputs a log `INFO` message
every 5 seconds with some key metrics:
- The current number of running/waiting requests
- The current GPU cache usage
- The number of prompt tokens processed per second over the past 5
seconds
- The number of new tokens generated per second over the past 5
seconds
- The prefix cache hit rate over the most recent 1k kv-cache block queries
### Metrics Publishing - Prometheus
The `PrometheusStatLogger` metrics publisher makes the metrics
available via a `/metrics` HTTP endpoint in a Prometheus-compatible
format. A Prometheus instance can then be configured to poll this
endpoint (e.g. every second) and record the values in its time-series
database. Prometheus is often used via Grafana, allowing these metrics
to be graphed over time.
Prometheus supports the following metric types:
- Counter: a value that will increase over time, never reducing, and
generally reset to zero when the vLLM instance restarts. For
example, the number of tokens generated over the lifetime of the
instance.
- Gauge: a value that goes up and down, for example the number of
requests currently scheduled for execution.
- Histogram: a count of metric samples, recorded in buckets. For
example, the number of requests whose TTFT was <1ms, <5ms, <10ms,
<20ms, and so on.
Prometheus metrics can also be labelled, allowing metrics to be
combined according to matching labels. In vLLM, we add a `model_name`
label to every metric which includes the name of the model served by
that instance.
Example output:
```bash
$ curl http://0.0.0.0:8000/metrics
# HELP vllm:num_requests_running Number of requests in model execution batches.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
...
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 27453.0
...
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 131.0
vllm:request_success_total{finished_reason="abort",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
...
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 13.0
vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 97.0
vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 123.0
vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 138.0
vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
```
Note - the choice of histogram buckets to be most useful to users
across a broad set of use cases is not straightforward and will
require refinement over time.
### Cache Config Info
`prometheus_client` has support for [Info
metrics](https://prometheus.github.io/client_python/instrumenting/info/)
which are equivalent to a `Gauge` whose value is permanently set to 1,
but exposes interesting key/value pair information via labels. This is
used for information about an instance that does not change - so it
only needs to be observed at startup - and allows comparing across
instances in Prometheus.
We use this concept for the `vllm:cache_config_info` metric:
```
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
```
However, `prometheus_client` has [never supported Info metrics in
multiprocessing
mode](https://github.com/prometheus/client_python/pull/300) - for
[unclear
reasons](gh-pr:7279#discussion_r1710417152). We
simply use a `Gauge` metric set to 1 and
`multiprocess_mode="mostrecent"` instead.
### LoRA Metrics
The `vllm:lora_requests_info` `Gauge` is somewhat similar, except the
value is the current wall-clock time, and is updated every iteration.
The label names used are:
- `running_lora_adapters`: a per-adapter count of the number requests
running using that adapter, formatted as a comma-separated string.
- `waiting_lora_adapters`: similar, except counting requests that are
waiting to be scheduled.
- `max_lora` - the static "max number of LoRAs in a single batch."
configuration.
Encoding a running/waiting counts for multiple adapters in a
comma-separated string seems quite misguided - we could use labels to
distinguish between per-adapter counts. This should be revisited.
Note that `multiprocess_mode="livemostrecent"` is used - the most
recent metric is used, but only from currently running processes.
This was added in
<gh-pr:9477> and there is
[at least one known
user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in
v0 also and asking this project to move to the new metric.
### Prefix Cache metrics
The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
some interesting points which may be relevant to how we approach
future metrics.
Every time the prefix cache is queried, we record the number of blocks
queried and the number of queried blocks present in the cache
(i.e. hits).
However, the metric of interest is the hit rate - i.e. the number of
hits per query.
In the case of logging, we expect the user is best served by
calculating the hit rate over a fixed number of the most recent
queries (the interval is fixed to 1k most recent queries for now).
In the case of Prometheus though, we should take advantage of the
time-series nature of Prometheus and allow the user to calculate the
hit rate over an interval of their choosing. For example, a PromQL
query to calculate the hit interval of the past 5 minutes:
```text
rate(cache_query_hit[5m]) / rate(cache_query_total[5m])
```
To achieve this, we should record the queries and hits as counters in
Prometheus, rather than recording the hit rate as a gauge.
## Deprecated Metrics
### How To Deprecate
Deprecating metrics shouldn't be taken lightly. Users may not notice a
metric has been deprecated, and may be quite inconvenienced when it is
suddenly (from their perspective) when it is removed, even if there is
an equivalent metric for them to use.
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
[deprecated](gh-pr:2764) (with a
comment in the code),
[removed](gh-pr:12383), and then
[noticed by a
user](gh-issue:13218).
In general:
1) We should be cautious about deprecating metrics, especially since
it can be hard to predict the user impact.
2) We should include a prominent deprecation notice in the help string
that is included in the `/metrics' output.
3) We should list deprecated metrics in user-facing documentation and
release notes.
4) We should consider hiding deprecated metrics behind a CLI argument
in order to give administrators [an escape
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them.
### Unimplemented - `vllm:tokens_total`
Added by <gh-pr:4464>, but apparently never implemented. This can just be
removed.
### Duplicated - Queue Time
The `vllm:time_in_queue_requests` Histogram metric was added by
<gh-pr:9659> and its calculation is:
```
self.metrics.first_scheduled_time = now
self.metrics.time_in_queue = now - self.metrics.arrival_time
```
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
us with:
```
if seq_group.is_finished():
if (seq_group.metrics.first_scheduled_time is not None and
seq_group.metrics.first_token_time is not None):
time_queue_requests.append(
seq_group.metrics.first_scheduled_time -
seq_group.metrics.arrival_time)
...
if seq_group.metrics.time_in_queue is not None:
time_in_queue_requests.append(
seq_group.metrics.time_in_queue)
```
This seems duplicative, and one of them should be removed. The latter
is used by the Grafana dashboard, so we should deprecate or remove the
former from v0.
### Prefix Cache Hit Rate
See above - we now expose 'queries' and 'hits' counters rather than a
'hit rate' gauge.
### KV Cache Offloading
Two v0 metrics relate to a "swapped" preemption mode that is no
longer relevant in v1:
- `vllm:num_requests_swapped`
- `vllm:cpu_cache_usage_perc`
In this mode, when a request is preempted (e.g. to make room in KV
cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`.
In v0, [VLLM has long supported beam
search](gh-issue:6226). The
SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU
swapping was intended for these beam search like cases.
Later, the concept of prefix caching was introduced, which allowed KV
cache blocks to be shared implicitly. This proved to be a better
option than CPU swapping since blocks can be evicted slowly on demand
and the part of the prompt that was evicted can be recomputed.
SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`). [Beam search was moved out of
the core (in
V0)](gh-issue:8306). There was a
lot of complex code for a very uncommon feature.
In V1, with prefix caching being better (zero over head) and therefore
on by default, the preemption and recompute strategy should work
better.
## Future Work
### Parallel Sampling
Some v0 metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt.
As part of adding parallel sampling support in <gh-pr:10980> we should
also add these metrics.
- `vllm:request_params_n` (Histogram)
Observes the value of the 'n' parameter of every finished request.
- `vllm:request_max_num_generation_tokens` (Histogram)
Observes the maximum output length of all sequences in every finished
sequence group. In the absence of parallel sampling, this is
equivalent to `vllm:request_generation_tokens`.
### Speculative Decoding
Some v0 metrics are specific to "speculative decoding". This is where
we generate candidate tokens using a faster, approximate method or
model and then validate those tokens with the larger model.
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
- `vllm:spec_decode_efficiency` (Gauge)
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
seculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context.
Note - we should probably expose acceptance rate as separate accepted
and draft counters, like we do for prefix caching hit rate. Efficiency
likely also needs similar treatment.
### Autoscaling and Load-balancing
A common use case for our metrics is to support automated scaling of
vLLM instances.
For related discussion from the [Kubernetes Serving Working
Group](https://github.com/kubernetes/community/tree/master/wg-serving),
see:
- [Standardizing Large Model Server Metrics in
Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
- [Benchmarking LLM Workloads for Performance Evaluation and
Autoscaling in
Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference
Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <gh-issue:5041> and <gh-pr:12726>.
This is a non-trivial topic. Consider this comment from Rob:
> I think this metric should focus on trying to estimate what the max
> concurrency that will cause the average request length > queries per
> second ... since this is really what will "saturate" the server.
A clear goal is that we should expose the metrics required to detect
this saturation point, so administrators can implement auto-scaling
rules based on those. However, in order to do so, we need to have a
clear view on how an administrator (and automated monitoring system)
should judge an instance as approaching saturation:
> To identify, what is the saturation point for model server compute
> (the inflection point where we cannot get more throughput with a
> higher request rate, but start to incur additional latency) so we
> can autoscale effectively?
### Metric Naming
Our approach to naming metrics probably deserves to be revisited:
1. The use of colons in metric names seems contrary to ["colons are
reserved for user defined recording
rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels)
2. Most of our metrics follow the convention of ending with units, but
not all do.
3. Some of our metric names end with `_total`:
```
If there is a suffix of `_total` on the metric name, it will be removed. When
exposing the time series for counter, a `_total` suffix will be added. This is
for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
requires the `_total` suffix.
```
### Adding More Metrics
There is no shortage of ideas for new metrics:
- Examples from other projects like
[TGI](https://github.com/IBM/text-generation-inference?tab=readme-ov-file#metrics)
- Proposals arising from specific use cases, like the Kubernetes
auto-scaling topic above
- Proposals that might arise out of standardisation efforts like
[OpenTelemetry Semantic Conventions for Gen
AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
We should be cautious in our approach to adding new metrics. While
metrics are often relatively straightforward to add:
1. They can be difficult to remove - see the section on deprecation
above.
2. They can have a meaningful performance impact when enabled. And
metrics are usually of very limited use unless they can be enabled
by default and in production.
3. They have an impact on development and maintenance of the
project. Every metric added to v0 has made this v1 effort more
time-consuming, and perhaps not all metrics justify this ongoing
investment in their maintenance.
## Tracing - OpenTelemetry
Metrics provide an aggregated view over time of the system's
performance and health. Tracing, on the other hand, tracks individual
requests as they move through different services and components. Both
fall under the more general heading of "Observability".
v0 has support for OpenTelemetry tracing:
- Added by <gh-pr:4687>
- Configured with `--oltp-traces-endpoint` and
`--collect-detailed-traces`
- [OpenTelemetry blog
post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing
docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html)
- [Blog
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product
docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
OpenTelemetry has a [Gen AI Working
Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
Since metrics is a big enough topic on its own, we are going to tackle
the topic of tracing in v1 separately.
### OpenTelemetry Model Forward vs Execute Time
In v0, we have the following two metrics:
- `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
in the model forward pass when this request was in the batch.
- `vllm:model_execute_time_milliseconds` (Histogram) - The time spent
in the model execute function. This will include model forward,
block/sync across workers, cpu-gpu sync time and sampling time.
These metrics are only enabled when OpenTelemetry tracing is enabled
and if `--collect-detailed-traces=all/model/worker` is used. The
documentation for this option states:
> collect detailed traces for the specified "modules. This involves
> use of possibly costly and or blocking operations and hence might
> have a performance impact.
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
as:
```
-> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117)
-> gen_ai.latency.time_in_model_forward: Double(3.151565277099609)
-> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676)
```
We already have `inference_time` and `decode_time` metrics, so the
question is whether there are sufficiently common use cases for the
higher-resolution timings to justify the overhead.
Since we are going to treat the question of OpenTelemetry support
separately, we will include these particular metrics under that topic.

View File

@ -3,7 +3,7 @@
# AutoAWQ
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).

View File

@ -0,0 +1,83 @@
(gptqmodel)=
# GPTQModel
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
Quantization reduces the model's precision from BF16/FP16 (16-bits) to INT4 (4-bits) or INT8 (8-bits) which significantly reduces the
total model memory footprint while at-the-same-time increasing inference performance.
Compatible GPTQModel quantized models can leverage the `Marlin` and `Machete` vLLM custom kernels to maximize batching
transactions-per-second `tps` and token-latency performance for both Ampere (A100+) and Hopper (H100+) Nvidia GPUs.
These two kernels are highly optimized by vLLM and NeuralMagic (now part of Redhat) to allow world-class inference performance of quantized GPTQ
models.
GPTQModel is one of the few quantization toolkits in the world that allows `Dynamic` per-module quantization where different layers and/or modules within a llm model can be further optimized with custom quantization parameters. `Dynamic` quantization
is fully integrated into vLLM and backed up by support from the ModelCloud.AI team. Please refer to [GPTQModel readme](https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-override)
for more details on this and other advanced features.
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?sort=trending&search=gptq).
```console
pip install -U gptqmodel --no-build-isolation -v
```
After installing GPTQModel, you are ready to quantize a model. Please refer to the [GPTQModel readme](https://github.com/ModelCloud/GPTQModel/?tab=readme-ov-file#quantization) for further details.
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
```python
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
```
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console
python examples/offline_inference/llm_engine_example.py --model DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
```
GPTQModel quantized models are also supported directly through the LLM entrypoint:
```python
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
# Create an LLM.
llm = LLM(model="DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

View File

@ -12,6 +12,7 @@ supported_hardware
auto_awq
bnb
gguf
gptqmodel
int4
int8
fp8

View File

@ -76,7 +76,13 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
}
```
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests.
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
## Limitations
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
- It is not compatible with [`tool_calling`](#tool_calling).
- The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning.
## How to support a new reasoning model
@ -117,7 +123,7 @@ class ExampleParser(ReasoningParser):
def extract_reasoning_content(
self, model_output: str, request: ChatCompletionRequest
) -> Tuple[Optional[str], Optional[str]]:
) -> tuple[Optional[str], Optional[str]]:
"""
Extract reasoning content from a complete model-generated string.
@ -132,20 +138,41 @@ class ExampleParser(ReasoningParser):
The request object that was used to generate the model_output.
Returns:
Tuple[Optional[str], Optional[str]]
tuple[Optional[str], Optional[str]]
A tuple containing the reasoning content and the content.
"""
```
After defining the reasoning parser, you can use it by specifying the `--reasoning-parser` flag when making a request to the chat completion endpoint.
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in `vllm/model_executor/guided_decoding/reasoner/deepseek_reasoner.py`.
```python
@dataclass
class DeepSeekReasoner(Reasoner):
"""
Reasoner for DeepSeek R series models.
"""
start_token_id: int
end_token_id: int
start_token: str = "<think>"
end_token: str = "</think>"
@classmethod
def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
return cls(start_token_id=tokenizer.encode(
"<think>", add_special_tokens=False)[0],
end_token_id=tokenizer.encode("</think>",
add_special_tokens=False)[0])
def is_reasoning_end(self, input_ids: list[int]) -> bool:
return self.end_token_id in input_ids
```
The structured output engine like xgrammar will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
Finally, you can enable reasoning for the model by using the `--enable-reasoning` and `--reasoning-parser` flags.
```bash
vllm serve <model_tag> \
--enable-reasoning --reasoning-parser example
```
## Limitations
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
- It is not compatible with the [`structured_outputs`](#structured_outputs) and [`tool_calling`](#tool_calling) features.
- The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning.

View File

@ -193,7 +193,7 @@ class Step(BaseModel):
class MathResponse(BaseModel):
steps: List[Step]
steps: list[Step]
final_answer: str

View File

@ -74,7 +74,7 @@ class Example:
path (Path): The path to the main directory or file.
category (str): The category of the document.
main_file (Path): The main file in the directory.
other_files (list[Path]): List of other files in the directory.
other_files (list[Path]): list of other files in the directory.
title (str): The title of the document.
Methods:

View File

@ -6,7 +6,14 @@ sudo apt-get install -y gcc-12 g++-12 libnuma-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
```
Second, install Python packages for vLLM CPU backend building:
Second, clone vLLM project:
```console
git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
```
Third, install Python packages for vLLM CPU backend building:
```console
pip install --upgrade pip

View File

@ -53,9 +53,9 @@ Currently, there are no pre-built ROCm wheels.
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
:::
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support)
Install ROCm's flash attention (v2.7.2) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention#amd-rocm-support)
Alternatively, wheels intended for vLLM use can be accessed under the releases.
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.

View File

@ -254,6 +254,10 @@ ValueError: Model architectures ['<arch>'] are not supported for now. Supported
But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).

View File

@ -158,6 +158,7 @@ design/multiprocessing
:maxdepth: 2
design/v1/prefix_caching
design/v1/metrics
:::
% How to contribute to the vLLM project

View File

@ -14,8 +14,11 @@ Alongside each architecture, we include some popular models that use it.
By default, vLLM loads models from [HuggingFace (HF) Hub](https://huggingface.co/models).
To determine whether a given model is supported, you can check the `config.json` file inside the HF repository.
If the `"architectures"` field contains a model architecture listed below, then it should be supported in theory.
To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
Models do not _need_ to be natively supported to be used in vLLM.
The <project:#transformers-fallback> enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
:::{tip}
The easiest way to check if your model is really supported at runtime is to run the program below:
@ -40,46 +43,41 @@ If vLLM successfully returns text (for generative models) or hidden states (for
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
(transformers-fallback)=
### Transformers fallback
`vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
vLLM can fallback to model implementations that are available in Transformers. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is `transformers`, you can simply do this:
To check if the backend is Transformers, you can simply do this:
```python
from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model
llm.apply_model(lambda model: print(model.__class__))
llm.apply_model(lambda model: print(type(model)))
```
If it is `TransformersModel` then it means it's based on `transformers`!
If it is `TransformersModel` then it means it's based on Transformers!
:::{note}
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
:::
#### Supported features
##### Quantization
The Transformers fallback explicitly supports the following features:
Transformers fallback has supported most of available quantization in vLLM (except GGUF). See [Quantization page](#quantization-index) for more information about supported quantization in vllm.
- <project:#quantization-index> (except GGUF)
- <project:#lora-adapter>
- <project:#distributed-serving> (pipeline parallel coming soon <gh-pr:12832>!)
##### LoRA
#### Remote code
LoRA hasn't supported on transformers fallback yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Earlier we mentioned that the Transformers fallback enables you to run remote code models directly in vLLM.
If you are interested in this feature, this section is for you!
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
Hints as to how this would look like:
```python
class TransformersModel(nn.Module, SupportsLoRA):
def __init__(*):
...
self.model.load_adapter(vllm_config.load_config.model_loader_extra_config["qlora_adapter_name_or_path"])
```
Blocker is that you need to specify supported lora layers, when we would ideally want to load whatever is inside the checkpoint!
##### Remote code
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
Simply set `trust_remote_code=True` and vLLM will run any model on the Model Hub that is compatible with Transformers.
Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
```python
from vllm import LLM
@ -87,16 +85,17 @@ llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of
llm.apply_model(lambda model: print(model.__class__))
```
A model just needs the following two things:
To make your model compatible with the Transformers fallback, it needs:
```{code-block} python
:caption: modeling_my_model.py
```python
from transformers import PreTrainedModel
from torch import nn
class MyAttention(nn.Module):
def forward(self, hidden_states, **kwargs): # <- kwargs are required
...
attention_interface = attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
@ -115,8 +114,26 @@ class MyModel(PreTrainedModel):
Here is what happens in the background:
1. The config is loaded
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See `/model_executors/models/transformers`, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
2. `MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
To make your model compatible with tensor parallel, it needs:
```{code-block} python
:caption: configuration_my_model.py
from transformers import PretrainedConfig
class MyConfig(PretrainedConfig):
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
...
}
```
:::{tip}
`base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
:::
That's it!
@ -281,6 +298,11 @@ See [this page](#generative-models) for more information on how to use generativ
* `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc.
* ✅︎
* ✅︎
- * `GraniteMoeSharedForCausalLM`
* Granite MoE Shared
* `ibm-research/moe-7b-1b-active-shared-experts` (test model)
* ✅︎
* ✅︎
- * `GritLM`
* GritLM
* `parasail-ai/GritLM-7B-vllm`.
@ -388,7 +410,7 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
- * `Phi3ForCausalLM`
* Phi-4, Phi-3
* `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc.
* `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc.
* ✅︎
* ✅︎
- * `Phi3SmallForCausalLM`
@ -834,6 +856,13 @@ See [this page](#generative-models) for more information on how to use generativ
*
* ✅︎
* ✅︎
- * `Phi4MMForCausalLM`
* Phi-4-multimodal
* T + I<sup>+</sup> / T + A<sup>+</sup> / I<sup>+</sup> + A<sup>+</sup>
* `microsoft/Phi-4-multimodal-instruct`, etc.
* ✅︎
*
*
- * `PixtralForConditionalGeneration`
* Pixtral
* T + I<sup>+</sup>
@ -906,7 +935,7 @@ Currently the PaliGemma model series is implemented without PrefixLM attention m
:::
:::{note}
To use Qwen2.5-VL series models, you have to install Huggingface `transformers` library from source via `pip install git+https://github.com/huggingface/transformers`.
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
:::
### Pooling Models

View File

@ -18,6 +18,7 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo
- Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
- Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache.
- Increase `pipeline_parallel_size`. This approach distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, which indirectly leaves more memory available for KV cache.
You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.

View File

@ -24,50 +24,7 @@ question_per_audio_count = {
# Unless specified, these settings have been tested to work on a single L4.
# Ultravox 0.5-1B
def run_ultravox(question: str, audio_count: int):
model_name = "fixie-ai/ultravox-v0_5-llama-3_2-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{
'role': 'user',
'content': "<|audio|>\n" * audio_count + question
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
llm = LLM(model=model_name,
max_model_len=4096,
max_num_seqs=5,
trust_remote_code=True,
limit_mm_per_prompt={"audio": audio_count})
stop_token_ids = None
return llm, prompt, stop_token_ids
# Qwen2-Audio
def run_qwen2_audio(question: str, audio_count: int):
model_name = "Qwen/Qwen2-Audio-7B-Instruct"
llm = LLM(model=model_name,
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
audio_in_prompt = "".join([
f"Audio {idx+1}: "
f"<|audio_bos|><|AUDIO|><|audio_eos|>\n" for idx in range(audio_count)
])
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
f"{audio_in_prompt}{question}<|im_end|>\n"
"<|im_start|>assistant\n")
stop_token_ids = None
return llm, prompt, stop_token_ids
# MiniCPM-O
def run_minicpmo(question: str, audio_count: int):
model_name = "openbmb/MiniCPM-o-2_6"
tokenizer = AutoTokenizer.from_pretrained(model_name,
@ -94,10 +51,71 @@ def run_minicpmo(question: str, audio_count: int):
return llm, prompt, stop_token_ids
# Qwen2-Audio
def run_qwen2_audio(question: str, audio_count: int):
model_name = "Qwen/Qwen2-Audio-7B-Instruct"
llm = LLM(model=model_name,
max_model_len=4096,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
audio_in_prompt = "".join([
f"Audio {idx+1}: "
f"<|audio_bos|><|AUDIO|><|audio_eos|>\n" for idx in range(audio_count)
])
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
f"{audio_in_prompt}{question}<|im_end|>\n"
"<|im_start|>assistant\n")
stop_token_ids = None
return llm, prompt, stop_token_ids
# Ultravox 0.5-1B
def run_ultravox(question: str, audio_count: int):
model_name = "fixie-ai/ultravox-v0_5-llama-3_2-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{
'role': 'user',
'content': "<|audio|>\n" * audio_count + question
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
llm = LLM(model=model_name,
max_model_len=4096,
max_num_seqs=5,
trust_remote_code=True,
limit_mm_per_prompt={"audio": audio_count})
stop_token_ids = None
return llm, prompt, stop_token_ids
# Whisper
def run_whisper(question: str, audio_count: int):
assert audio_count == 1, (
"Whisper only support single audio input per prompt")
model_name = "openai/whisper-large-v3-turbo"
prompt = "<|startoftranscript|>"
llm = LLM(model=model_name,
max_model_len=448,
max_num_seqs=5,
limit_mm_per_prompt={"audio": audio_count})
stop_token_ids = None
return llm, prompt, stop_token_ids
model_example_map = {
"ultravox": run_ultravox,
"minicpmo": run_minicpmo,
"qwen2_audio": run_qwen2_audio,
"minicpmo": run_minicpmo
"ultravox": run_ultravox,
"whisper": run_whisper,
}

View File

@ -10,10 +10,14 @@ prompts = [
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
sampling_params = SamplingParams() #temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# llm = LLM(model="facebook/opt-125m")
llm = LLM(model="Qwen/Qwen2-1.5B-Instruct",
max_num_seqs=16,
max_model_len=128,
enforce_eager=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
@ -21,4 +25,4 @@ outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

View File

@ -1,5 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
# usage: VLLM_USE_V1=1 python examples/offline_inference/data_parallel.py
# usage:
# VLLM_TEST_ENABLE_EP=1 VLLM_USE_V1=1 \
# python examples/offline_inference/data_parallel.py
# we need to have a launcher to create multiple data parallel
# ranks. And each rank will create a vLLM instance to process its own prompts.
import os
@ -7,6 +9,9 @@ import os
from vllm import LLM, SamplingParams
from vllm.utils import get_open_port
GPUs_per_dp_rank = 2
DP_size = 2
def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
os.environ["VLLM_DP_RANK"] = str(dp_rank)
@ -48,8 +53,8 @@ def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
max_tokens=16 * (dp_rank + 1))
# Create an LLM.
llm = LLM(model="facebook/opt-125m",
tensor_parallel_size=2,
llm = LLM(model="ibm-research/PowerMoE-3b",
tensor_parallel_size=GPUs_per_dp_rank,
enforce_eager=True)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
@ -62,14 +67,12 @@ def main(dp_size, dp_rank, dp_master_ip, dp_master_port, GPUs_per_dp_rank):
if __name__ == "__main__":
from multiprocessing import Process
dp_size = 2
GPUs_per_dp_rank = 2
dp_master_ip = "127.0.0.1"
dp_master_port = get_open_port()
procs = []
for i in range(dp_size):
for i in range(DP_size):
proc = Process(target=main,
args=(dp_size, i, dp_master_ip, dp_master_port,
args=(DP_size, i, dp_master_ip, dp_master_port,
GPUs_per_dp_rank))
proc.start()
procs.append(proc)

View File

@ -6,7 +6,7 @@ distributively on a multi-nodes cluster.
Learn more about Ray Data in https://docs.ray.io/en/latest/data/data.html
"""
from typing import Any, Dict, List
from typing import Any
import numpy as np
import ray
@ -36,13 +36,13 @@ class LLMPredictor:
self.llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=tensor_parallel_size)
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, list]:
# Generate texts from the prompts.
# The output is a list of RequestOutput objects that contain the prompt,
# generated text, and other information.
outputs = self.llm.generate(batch["text"], sampling_params)
prompt: List[str] = []
generated_text: List[str] = []
prompt: list[str] = []
generated_text: list[str] = []
for output in outputs:
prompt.append(output.prompt)
generated_text.append(' '.join([o.text for o in output.outputs]))
@ -72,7 +72,7 @@ def scheduling_strategy_fn():
pg, placement_group_capture_child_tasks=True))
resources_kwarg: Dict[str, Any] = {}
resources_kwarg: dict[str, Any] = {}
if tensor_parallel_size == 1:
# For tensor_parallel_size == 1, we simply set num_gpus=1.
resources_kwarg["num_gpus"] = 1

View File

@ -0,0 +1,158 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use vLLM for running offline inference with
the explicit/implicit prompt format on enc-dec LMMs for text generation.
"""
import time
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.assets.image import ImageAsset
from vllm.utils import FlexibleArgumentParser
def run_florence2():
# Create a Florence-2 encoder/decoder model instance
llm = LLM(
model="microsoft/Florence-2-large",
tokenizer="facebook/bart-large",
max_num_seqs=8,
trust_remote_code=True,
limit_mm_per_prompt={"image": 1},
dtype="half",
)
prompts = [
{ # implicit prompt with task token
"prompt": "<DETAILED_CAPTION>",
"multi_modal_data": {
"image": ImageAsset("stop_sign").pil_image
},
},
{ # explicit encoder/decoder prompt
"encoder_prompt": {
"prompt": "Describe in detail what is shown in the image.",
"multi_modal_data": {
"image": ImageAsset("cherry_blossom").pil_image
},
},
"decoder_prompt": "",
},
]
return llm, prompts
def run_mllama():
# Create a Mllama encoder/decoder model instance
llm = LLM(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
max_model_len=4096,
max_num_seqs=2,
limit_mm_per_prompt={"image": 1},
dtype="half",
)
prompts = [
{ # Implicit prompt
"prompt": "<|image|><|begin_of_text|>What is the content of this image?", # noqa: E501
"multi_modal_data": {
"image": ImageAsset("stop_sign").pil_image,
},
},
{ # Explicit prompt
"encoder_prompt": {
"prompt": "<|image|>",
"multi_modal_data": {
"image": ImageAsset("stop_sign").pil_image,
},
},
"decoder_prompt": "<|image|><|begin_of_text|>Please describe the image.", # noqa: E501
},
]
return llm, prompts
def run_whisper():
# Create a Whisper encoder/decoder model instance
llm = LLM(
model="openai/whisper-large-v3-turbo",
max_model_len=448,
max_num_seqs=16,
limit_mm_per_prompt={"audio": 1},
dtype="half",
)
prompts = [
{ # Test implicit prompt
"prompt": "<|startoftranscript|>",
"multi_modal_data": {
"audio": AudioAsset("mary_had_lamb").audio_and_sample_rate,
},
},
{ # Test explicit encoder/decoder prompt
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": AudioAsset("winning_call").audio_and_sample_rate,
},
},
"decoder_prompt": "<|startoftranscript|>",
}
]
return llm, prompts
model_example_map = {
"florence2": run_florence2,
"mllama": run_mllama,
"whisper": run_whisper,
}
def main(args):
model = args.model_type
if model not in model_example_map:
raise ValueError(f"Model type {model} is not supported.")
llm, prompts = model_example_map[model]()
# Create a sampling params object.
sampling_params = SamplingParams(
temperature=0,
top_p=1.0,
max_tokens=64,
)
start = time.time()
# Generate output tokens from the prompts. The output is a list of
# RequestOutput objects that contain the prompt, generated
# text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Decoder prompt: {prompt!r}, "
f"Generated text: {generated_text!r}")
duration = time.time() - start
print("Duration:", duration)
print("RPS:", len(prompts) / duration)
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description='Demo on using vLLM for offline inference with '
'vision language models for text generation')
parser.add_argument('--model-type',
'-m',
type=str,
default="mllama",
choices=model_example_map.keys(),
help='Huggingface "model_type".')
args = parser.parse_args()
main(args)

View File

@ -1,53 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
"""
Demonstrate prompting of text-to-text
encoder/decoder models, specifically Florence-2
"""
# TODO(Isotr0py):
# Move to offline_inference/vision_language.py
# after porting vision backbone
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
# Create a Florence-2 encoder/decoder model instance
llm = LLM(
model="microsoft/Florence-2-large",
tokenizer="facebook/bart-large",
max_num_seqs=8,
trust_remote_code=True,
)
prompts = [
{ # implicit prompt with task token
"prompt": "<DETAILED_CAPTION>",
"multi_modal_data": {
"image": ImageAsset("stop_sign").pil_image
},
},
{ # explicit encoder/decoder prompt
"encoder_prompt": {
"prompt": "Describe in detail what is shown in the image.",
"multi_modal_data": {
"image": ImageAsset("cherry_blossom").pil_image
},
},
"decoder_prompt": "",
},
]
# Create a sampling params object.
sampling_params = SamplingParams(
temperature=0,
top_p=1.0,
min_tokens=0,
max_tokens=128,
)
# Generate output tokens from the prompts. The output is a list of
# RequestOutput objects that contain the prompt, generated
# text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
generated_text = output.outputs[0].text
print(f"Generated text: {generated_text!r}")

View File

@ -1,13 +1,12 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
from typing import List, Tuple
from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
from vllm.utils import FlexibleArgumentParser
def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
def create_test_prompts() -> list[tuple[str, SamplingParams]]:
"""Create a list of test prompts with their sampling parameters."""
return [
("A robot may not injure a human being",
@ -24,7 +23,7 @@ def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams]]):
test_prompts: list[tuple[str, SamplingParams]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
@ -34,7 +33,7 @@ def process_requests(engine: LLMEngine,
engine.add_request(str(request_id), prompt, sampling_params)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
request_outputs: list[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:

View File

@ -7,7 +7,7 @@ Requires HuggingFace credentials for access.
"""
import gc
from typing import List, Optional, Tuple
from typing import Optional
import torch
from huggingface_hub import snapshot_download
@ -18,7 +18,7 @@ from vllm.lora.request import LoRARequest
def create_test_prompts(
lora_path: str
) -> List[Tuple[str, SamplingParams, Optional[LoRARequest]]]:
) -> list[tuple[str, SamplingParams, Optional[LoRARequest]]]:
return [
# this is an example of using quantization without LoRA
("My name is",
@ -49,7 +49,7 @@ def create_test_prompts(
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams,
test_prompts: list[tuple[str, SamplingParams,
Optional[LoRARequest]]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
@ -63,7 +63,7 @@ def process_requests(engine: LLMEngine,
lora_request=lora_request)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
request_outputs: list[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:
print("----------------------------------------------------")

View File

@ -2,12 +2,11 @@
import gc
import time
from typing import List
from vllm import LLM, SamplingParams
def time_generation(llm: LLM, prompts: List[str],
def time_generation(llm: LLM, prompts: list[str],
sampling_params: SamplingParams):
# Generate texts from the prompts. The output is a list of RequestOutput
# objects that contain the prompt, generated text, and other information.

View File

@ -6,7 +6,7 @@ for offline inference.
Requires HuggingFace credentials for access to Llama2.
"""
from typing import List, Optional, Tuple
from typing import Optional
from huggingface_hub import snapshot_download
@ -16,7 +16,7 @@ from vllm.lora.request import LoRARequest
def create_test_prompts(
lora_path: str
) -> List[Tuple[str, SamplingParams, Optional[LoRARequest]]]:
) -> list[tuple[str, SamplingParams, Optional[LoRARequest]]]:
"""Create a list of test prompts with their sampling parameters.
2 requests for base model, 4 requests for the LoRA. We define 2
@ -56,7 +56,7 @@ def create_test_prompts(
def process_requests(engine: LLMEngine,
test_prompts: List[Tuple[str, SamplingParams,
test_prompts: list[tuple[str, SamplingParams,
Optional[LoRARequest]]]):
"""Continuously process a list of prompts and handle the outputs."""
request_id = 0
@ -70,7 +70,7 @@ def process_requests(engine: LLMEngine,
lora_request=lora_request)
request_id += 1
request_outputs: List[RequestOutput] = engine.step()
request_outputs: list[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:

View File

@ -21,7 +21,7 @@ import argparse
import datetime
import os
import re
from typing import List, Union
from typing import Union
import albumentations
import numpy as np
@ -260,9 +260,9 @@ def _convert_np_uint8(float_image: torch.Tensor):
def load_example(
file_paths: List[str],
mean: List[float] = None,
std: List[float] = None,
file_paths: list[str],
mean: list[float] = None,
std: list[float] = None,
indices: Union[list[int], None] = None,
):
"""Build an input example by loading images in *file_paths*.

View File

@ -5,8 +5,9 @@ import json
import os
import sys
from argparse import RawTextHelpFormatter
from collections.abc import Generator
from dataclasses import asdict, dataclass
from typing import Any, Dict, Generator, List, Optional, TypeAlias
from typing import Any, Optional, TypeAlias
import torch
import tqdm
@ -42,8 +43,8 @@ def get_dtype(dtype: str):
return dtype
OutputLen_NumReqs_Map: TypeAlias = Dict[int, int]
def compute_request_output_lengths(batch_size: int, step_requests: List[int]) \
OutputLen_NumReqs_Map: TypeAlias = dict[int, int]
def compute_request_output_lengths(batch_size: int, step_requests: list[int]) \
-> OutputLen_NumReqs_Map:
"""
Given the number of requests, batch_size, and the number of requests
@ -63,7 +64,7 @@ def compute_request_output_lengths(batch_size: int, step_requests: List[int]) \
Args:
batch_size (int): Number of requests submitted for profile. This is
args.batch_size.
step_requests (List[int]): step_requests[i] is the number of requests
step_requests (list[int]): step_requests[i] is the number of requests
that the ith engine step should process.
Returns:
@ -114,7 +115,7 @@ def compute_request_output_lengths(batch_size: int, step_requests: List[int]) \
return ol_nr
def determine_requests_per_step(context: ProfileContext) -> List[int]:
def determine_requests_per_step(context: ProfileContext) -> list[int]:
"""
Determine number of requests each engine step should process.
If context.num_steps is set, then all engine steps process the
@ -130,7 +131,7 @@ def determine_requests_per_step(context: ProfileContext) -> List[int]:
context: ProfileContext object.
Returns:
List[int]: Number of requests to process for all engine-steps.
list[int]: Number of requests to process for all engine-steps.
output[i], contains the number of requests that the ith step
should process.
"""
@ -170,7 +171,7 @@ def run_profile(context: ProfileContext, csv_output: Optional[str],
for key, value in asdict(context).items():
print(f" {key} = {value}")
requests_per_step: List[int] = determine_requests_per_step(context)
requests_per_step: list[int] = determine_requests_per_step(context)
ol_nr: OutputLen_NumReqs_Map = compute_request_output_lengths(
context.batch_size, requests_per_step)

View File

@ -4,7 +4,6 @@ import argparse
import dataclasses
import os
import time
from typing import List
import numpy as np
import torch_xla.debug.profiler as xp
@ -35,7 +34,7 @@ def main(args: argparse.Namespace):
dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size,
args.input_len))
dummy_prompts: List[PromptType] = [{
dummy_prompts: list[PromptType] = [{
"prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()]

View File

@ -21,7 +21,7 @@ from vllm.utils import FlexibleArgumentParser
# Aria
def run_aria(question: str, modality: str):
def run_aria(questions: list[str], modality: str):
assert modality == "image"
model_name = "rhymes-ai/Aria"
@ -32,41 +32,42 @@ def run_aria(question: str, modality: str):
dtype="bfloat16",
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
prompt = (f"<|im_start|>user\n<fim_prefix><|img|><fim_suffix>{question}"
"<|im_end|>\n<|im_start|>assistant\n")
prompts = [(f"<|im_start|>user\n<fim_prefix><|img|><fim_suffix>{question}"
"<|im_end|>\n<|im_start|>assistant\n")
for question in questions]
stop_token_ids = [93532, 93653, 944, 93421, 1019, 93653, 93519]
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# BLIP-2
def run_blip2(question: str, modality: str):
def run_blip2(questions: list[str], modality: str):
assert modality == "image"
# BLIP-2 prompt format is inaccurate on HuggingFace model repository.
# See https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/15#64ff02f3f8cf9e4f5b038262 #noqa
prompt = f"Question: {question} Answer:"
prompts = [f"Question: {question} Answer:" for question in questions]
llm = LLM(model="Salesforce/blip2-opt-2.7b",
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Chameleon
def run_chameleon(question: str, modality: str):
def run_chameleon(questions: list[str], modality: str):
assert modality == "image"
prompt = f"{question}<image>"
prompts = [f"{question}<image>" for question in questions]
llm = LLM(model="facebook/chameleon-7b",
max_model_len=4096,
max_num_seqs=2,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Deepseek-VL2
def run_deepseek_vl2(question: str, modality: str):
def run_deepseek_vl2(questions: list[str], modality: str):
assert modality == "image"
model_name = "deepseek-ai/deepseek-vl2-tiny"
@ -77,9 +78,12 @@ def run_deepseek_vl2(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]})
prompt = f"<|User|>: <image>\n{question}\n\n<|Assistant|>:"
prompts = [
f"<|User|>: <image>\n{question}\n\n<|Assistant|>:"
for question in questions
]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Florence2
@ -99,20 +103,20 @@ def run_florence2(question: str, modality: str):
# Fuyu
def run_fuyu(question: str, modality: str):
def run_fuyu(questions: list[str], modality: str):
assert modality == "image"
prompt = f"{question}\n"
prompts = [f"{question}\n" for question in questions]
llm = LLM(model="adept/fuyu-8b",
max_model_len=2048,
max_num_seqs=2,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# GLM-4v
def run_glm4v(question: str, modality: str):
def run_glm4v(questions: list[str], modality: str):
assert modality == "image"
model_name = "THUDM/glm-4v-9b"
@ -124,15 +128,17 @@ def run_glm4v(question: str, modality: str):
hf_overrides={"architectures": ["GLM4VForCausalLM"]},
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
prompt = f"<|user|>\n<|begin_of_image|><|endoftext|><|end_of_image|>\
{question}<|assistant|>"
prompts = [
f"<|user|>\n<|begin_of_image|><|endoftext|><|end_of_image|>\
{question}<|assistant|>" for question in questions
]
stop_token_ids = [151329, 151336, 151338]
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# H2OVL-Mississippi
def run_h2ovl(question: str, modality: str):
def run_h2ovl(questions: list[str], modality: str):
assert modality == "image"
model_name = "h2oai/h2ovl-mississippi-800m"
@ -146,19 +152,24 @@ def run_h2ovl(question: str, modality: str):
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
messages = [{'role': 'user', 'content': f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
prompts = [
tokenizer.apply_chat_template([{
'role': 'user',
'content': f"<image>\n{question}"
}],
tokenize=False,
add_generation_prompt=True)
for question in questions
]
# Stop tokens for H2OVL-Mississippi
# https://huggingface.co/h2oai/h2ovl-mississippi-800m
stop_token_ids = [tokenizer.eos_token_id]
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Idefics3-8B-Llama3
def run_idefics3(question: str, modality: str):
def run_idefics3(questions: list[str], modality: str):
assert modality == "image"
model_name = "HuggingFaceM4/Idefics3-8B-Llama3"
@ -176,15 +187,15 @@ def run_idefics3(question: str, modality: str):
},
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
prompt = (
prompts = [(
f"<|begin_of_text|>User:<image>{question}<end_of_utterance>\nAssistant:"
)
) for question in questions]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# InternVL
def run_internvl(question: str, modality: str):
def run_internvl(questions: list[str], modality: str):
assert modality == "image"
model_name = "OpenGVLab/InternVL2-2B"
@ -198,10 +209,15 @@ def run_internvl(question: str, modality: str):
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
messages = [{'role': 'user', 'content': f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
prompts = [
tokenizer.apply_chat_template([{
'role': 'user',
'content': f"<image>\n{question}"
}],
tokenize=False,
add_generation_prompt=True)
for question in questions
]
# Stop tokens for InternVL
# models variants may have different stop tokens
@ -209,71 +225,82 @@ def run_internvl(question: str, modality: str):
# https://huggingface.co/OpenGVLab/InternVL2-2B/blob/main/conversation.py
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# LLaVA-1.5
def run_llava(question: str, modality: str):
def run_llava(questions: list[str], modality: str):
assert modality == "image"
prompt = f"USER: <image>\n{question}\nASSISTANT:"
prompts = [
f"USER: <image>\n{question}\nASSISTANT:" for question in questions
]
llm = LLM(model="llava-hf/llava-1.5-7b-hf",
max_model_len=4096,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# LLaVA-1.6/LLaVA-NeXT
def run_llava_next(question: str, modality: str):
def run_llava_next(questions: list[str], modality: str):
assert modality == "image"
prompt = f"[INST] <image>\n{question} [/INST]"
prompts = [f"[INST] <image>\n{question} [/INST]" for question in questions]
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf",
max_model_len=8192,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# LlaVA-NeXT-Video
# Currently only support for video input
def run_llava_next_video(question: str, modality: str):
def run_llava_next_video(questions: list[str], modality: str):
assert modality == "video"
prompt = f"USER: <video>\n{question} ASSISTANT:"
prompts = [
f"USER: <video>\n{question} ASSISTANT:" for question in questions
]
llm = LLM(model="llava-hf/LLaVA-NeXT-Video-7B-hf",
max_model_len=8192,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# LLaVA-OneVision
def run_llava_onevision(question: str, modality: str):
def run_llava_onevision(questions: list[str], modality: str):
if modality == "video":
prompt = f"<|im_start|>user <video>\n{question}<|im_end|> \
<|im_start|>assistant\n"
prompts = [
f"<|im_start|>user <video>\n{question}<|im_end|> \
<|im_start|>assistant\n" for question in questions
]
elif modality == "image":
prompt = f"<|im_start|>user <image>\n{question}<|im_end|> \
<|im_start|>assistant\n"
prompts = [
f"<|im_start|>user <image>\n{question}<|im_end|> \
<|im_start|>assistant\n" for question in questions
]
llm = LLM(model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
max_model_len=16384,
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Mantis
def run_mantis(question: str, modality: str):
def run_mantis(questions: list[str], modality: str):
assert modality == "image"
llama3_template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' # noqa: E501
prompt = llama3_template.format(f"{question}\n<image>")
prompts = [
llama3_template.format(f"{question}\n<image>")
for question in questions
]
llm = LLM(
model="TIGER-Lab/Mantis-8B-siglip-llama3",
@ -282,11 +309,11 @@ def run_mantis(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
stop_token_ids = [128009]
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# MiniCPM-V
def run_minicpmv_base(question: str, modality: str, model_name):
def run_minicpmv_base(questions: list[str], modality: str, model_name):
assert modality in ["image", "video"]
# If you want to use `MiniCPM-o-2_6` with audio inputs, check `audio_language.py` # noqa
@ -333,26 +360,28 @@ def run_minicpmv_base(question: str, modality: str, model_name):
"video": "(<video>./</video>)",
}
messages = [{
'role': 'user',
'content': f'{modality_placeholder[modality]}\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
return llm, prompt, stop_token_ids
prompts = [
tokenizer.apply_chat_template(
[{
'role': 'user',
'content': f"{modality_placeholder[modality]}\n{question}"
}],
tokenize=False,
add_generation_prompt=True) for question in questions
]
return llm, prompts, stop_token_ids
def run_minicpmo(question: str, modality: str):
return run_minicpmv_base(question, modality, "openbmb/MiniCPM-o-2_6")
def run_minicpmo(questions: list[str], modality: str):
return run_minicpmv_base(questions, modality, "openbmb/MiniCPM-o-2_6")
def run_minicpmv(question: str, modality: str):
return run_minicpmv_base(question, modality, "openbmb/MiniCPM-V-2_6")
def run_minicpmv(questions: list[str], modality: str):
return run_minicpmv_base(questions, modality, "openbmb/MiniCPM-V-2_6")
# LLama 3.2
def run_mllama(question: str, modality: str):
def run_mllama(questions: list[str], modality: str):
assert modality == "image"
model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
@ -379,16 +408,16 @@ def run_mllama(question: str, modality: str):
"type": "text",
"text": f"{question}"
}]
}]
prompt = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=False)
} for question in questions]
prompts = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=False)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Molmo
def run_molmo(question, modality):
def run_molmo(questions: list[str], modality: str):
assert modality == "image"
model_name = "allenai/Molmo-7B-D-0924"
@ -400,13 +429,16 @@ def run_molmo(question, modality):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
prompt = question
prompts = [
f"<|im_start|>user <image>\n{question}<|im_end|> \
<|im_start|>assistant\n" for question in questions
]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# NVLM-D
def run_nvlm_d(question: str, modality: str):
def run_nvlm_d(questions: list[str], modality: str):
assert modality == "image"
model_name = "nvidia/NVLM-D-72B"
@ -422,12 +454,15 @@ def run_nvlm_d(question: str, modality: str):
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
messages = [{'role': 'user', 'content': f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
messages = [{
'role': 'user',
'content': f"<image>\n{question}"
} for question in questions]
prompts = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# PaliGemma
@ -435,7 +470,7 @@ def run_paligemma(question: str, modality: str):
assert modality == "image"
# PaliGemma has special prompt format for VQA
prompt = "caption en"
prompt = ["caption en"]
llm = LLM(model="google/paligemma-3b-mix-224",
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
@ -447,7 +482,7 @@ def run_paligemma2(question: str, modality: str):
assert modality == "image"
# PaliGemma 2 has special prompt format for VQA
prompt = "caption en"
prompt = ["caption en"]
llm = LLM(model="google/paligemma2-3b-ft-docci-448",
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
stop_token_ids = None
@ -455,10 +490,13 @@ def run_paligemma2(question: str, modality: str):
# Phi-3-Vision
def run_phi3v(question: str, modality: str):
def run_phi3v(questions: list[str], modality: str):
assert modality == "image"
prompt = f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"
prompts = [
f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n"
for question in questions
]
# num_crops is an override kwarg to the multimodal image processor;
# For some models, e.g., Phi-3.5-vision-instruct, it is recommended
@ -482,11 +520,11 @@ def run_phi3v(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Pixtral HF-format
def run_pixtral_hf(question: str, modality: str):
def run_pixtral_hf(questions: list[str], modality: str):
assert modality == "image"
model_name = "mistral-community/pixtral-12b"
@ -499,13 +537,13 @@ def run_pixtral_hf(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
prompt = f"<s>[INST]{question}\n[IMG][/INST]"
prompts = [f"<s>[INST]{question}\n[IMG][/INST]" for question in questions]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Qwen
def run_qwen_vl(question: str, modality: str):
def run_qwen_vl(questions: list[str], modality: str):
assert modality == "image"
llm = LLM(
@ -517,13 +555,13 @@ def run_qwen_vl(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)
prompt = f"{question}Picture 1: <img></img>\n"
prompts = [f"{question}Picture 1: <img></img>\n" for question in questions]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Qwen2-VL
def run_qwen2_vl(question: str, modality: str):
def run_qwen2_vl(questions: list[str], modality: str):
model_name = "Qwen/Qwen2-VL-7B-Instruct"
@ -544,16 +582,18 @@ def run_qwen2_vl(question: str, modality: str):
elif modality == "video":
placeholder = "<|video_pad|>"
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")
prompts = [
("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n") for question in questions
]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
# Qwen2.5-VL
def run_qwen2_5_vl(question: str, modality: str):
def run_qwen2_5_vl(questions: list[str], modality: str):
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
@ -574,12 +614,14 @@ def run_qwen2_5_vl(question: str, modality: str):
elif modality == "video":
placeholder = "<|video_pad|>"
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")
prompts = [
("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<|vision_start|>{placeholder}<|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n") for question in questions
]
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompts, stop_token_ids
model_example_map = {
@ -624,29 +666,35 @@ def get_multi_modal_input(args):
# Input image and question
image = ImageAsset("cherry_blossom") \
.pil_image.convert("RGB")
img_question = "What is the content of this image?"
img_questions = [
"What is the content of this image?",
"Describe the content of this image in detail.",
"What's in the image?",
"Where is this image taken?",
]
return {
"data": image,
"question": img_question,
"questions": img_questions,
}
if args.modality == "video":
# Input video and question
video = VideoAsset(name="sample_demo_1.mp4",
num_frames=args.num_frames).np_ndarrays
vid_question = "Why is this video funny?"
vid_questions = ["Why is this video funny?"]
return {
"data": video,
"question": vid_question,
"questions": vid_questions,
}
msg = f"Modality {args.modality} is not supported."
raise ValueError(msg)
def apply_image_repeat(image_repeat_prob, num_prompts, data, prompt, modality):
def apply_image_repeat(image_repeat_prob, num_prompts, data,
prompts: list[str], modality):
"""Repeats images with provided probability of "image_repeat_prob".
Used to simulate hit/miss for the MM preprocessor cache.
"""
@ -666,7 +714,7 @@ def apply_image_repeat(image_repeat_prob, num_prompts, data, prompt, modality):
cur_image.putpixel((0, 0), new_val)
inputs.append({
"prompt": prompt,
"prompt": prompts[i % len(prompts)],
"multi_modal_data": {
modality: cur_image
}
@ -683,9 +731,14 @@ def main(args):
modality = args.modality
mm_input = get_multi_modal_input(args)
data = mm_input["data"]
question = mm_input["question"]
questions = mm_input["questions"]
llm, prompt, stop_token_ids = model_example_map[model](question, modality)
llm, prompts, stop_token_ids = model_example_map[model](questions,
modality)
# Don't want to check the flag multiple times, so just hijack `prompts`.
prompts = prompts if args.use_different_prompt_per_request else [
prompts[0]
]
# We set temperature to 0.2 so that outputs can be different
# even when all prompts are identical when running batch inference.
@ -697,27 +750,26 @@ def main(args):
if args.num_prompts == 1:
# Single inference
inputs = {
"prompt": prompt,
"prompt": prompts[0],
"multi_modal_data": {
modality: data
},
}
else:
# Batch inference
if args.image_repeat_prob is not None:
# Repeat images with specified probability of "image_repeat_prob"
inputs = apply_image_repeat(args.image_repeat_prob,
args.num_prompts, data, prompt,
args.num_prompts, data, prompts,
modality)
else:
# Use the same image for all prompts
inputs = [{
"prompt": prompt,
"prompt": prompts[i % len(prompts)],
"multi_modal_data": {
modality: data
},
} for _ in range(args.num_prompts)]
} for i in range(args.num_prompts)]
if args.time_generate:
import time
@ -775,5 +827,11 @@ if __name__ == "__main__":
action='store_true',
help='If True, then print the total generate() call time')
parser.add_argument(
'--use-different-prompt-per-request',
action='store_true',
help='If True, then use different prompt (with the same multi-modal '
'data) for each request.')
args = parser.parse_args()
main(args)

View File

@ -5,7 +5,7 @@ multi-image input on vision language models for text generation,
using the chat template defined by the model.
"""
from argparse import Namespace
from typing import List, NamedTuple, Optional
from typing import NamedTuple, Optional
from PIL.Image import Image
from transformers import AutoProcessor, AutoTokenizer
@ -24,8 +24,8 @@ IMAGE_URLS = [
class ModelRequestData(NamedTuple):
llm: LLM
prompt: str
stop_token_ids: Optional[List[int]]
image_data: List[Image]
stop_token_ids: Optional[list[int]]
image_data: list[Image]
chat_template: Optional[str]
@ -34,7 +34,7 @@ class ModelRequestData(NamedTuple):
# Unless specified, these settings have been tested to work on a single L4.
def load_aria(question, image_urls: List[str]) -> ModelRequestData:
def load_aria(question, image_urls: list[str]) -> ModelRequestData:
model_name = "rhymes-ai/Aria"
llm = LLM(model=model_name,
tokenizer_mode="slow",
@ -55,7 +55,7 @@ def load_aria(question, image_urls: List[str]) -> ModelRequestData:
)
def load_deepseek_vl2(question: str, image_urls: List[str]):
def load_deepseek_vl2(question: str, image_urls: list[str]):
model_name = "deepseek-ai/deepseek-vl2-tiny"
llm = LLM(model=model_name,
@ -77,7 +77,7 @@ def load_deepseek_vl2(question: str, image_urls: List[str]):
)
def load_h2ovl(question: str, image_urls: List[str]) -> ModelRequestData:
def load_h2ovl(question: str, image_urls: list[str]) -> ModelRequestData:
model_name = "h2oai/h2ovl-mississippi-800m"
llm = LLM(
@ -111,7 +111,7 @@ def load_h2ovl(question: str, image_urls: List[str]) -> ModelRequestData:
)
def load_idefics3(question, image_urls: List[str]) -> ModelRequestData:
def load_idefics3(question, image_urls: list[str]) -> ModelRequestData:
model_name = "HuggingFaceM4/Idefics3-8B-Llama3"
# The configuration below has been confirmed to launch on a single L40 GPU.
@ -142,7 +142,7 @@ def load_idefics3(question, image_urls: List[str]) -> ModelRequestData:
)
def load_internvl(question: str, image_urls: List[str]) -> ModelRequestData:
def load_internvl(question: str, image_urls: list[str]) -> ModelRequestData:
model_name = "OpenGVLab/InternVL2-2B"
llm = LLM(
@ -179,7 +179,7 @@ def load_internvl(question: str, image_urls: List[str]) -> ModelRequestData:
)
def load_mllama(question, image_urls: List[str]) -> ModelRequestData:
def load_mllama(question, image_urls: list[str]) -> ModelRequestData:
model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
# The configuration below has been confirmed to launch on a single L40 GPU.
@ -201,7 +201,7 @@ def load_mllama(question, image_urls: List[str]) -> ModelRequestData:
)
def load_nvlm_d(question: str, image_urls: List[str]):
def load_nvlm_d(question: str, image_urls: list[str]):
model_name = "nvidia/NVLM-D-72B"
# Adjust this as necessary to fit in GPU
@ -234,7 +234,7 @@ def load_nvlm_d(question: str, image_urls: List[str]):
)
def load_pixtral_hf(question: str, image_urls: List[str]) -> ModelRequestData:
def load_pixtral_hf(question: str, image_urls: list[str]) -> ModelRequestData:
model_name = "mistral-community/pixtral-12b"
# Adjust this as necessary to fit in GPU
@ -259,7 +259,7 @@ def load_pixtral_hf(question: str, image_urls: List[str]) -> ModelRequestData:
)
def load_phi3v(question: str, image_urls: List[str]) -> ModelRequestData:
def load_phi3v(question: str, image_urls: list[str]) -> ModelRequestData:
# num_crops is an override kwarg to the multimodal image processor;
# For some models, e.g., Phi-3.5-vision-instruct, it is recommended
# to use 16 for single frame scenarios, and 4 for multi-frame.
@ -295,7 +295,7 @@ def load_phi3v(question: str, image_urls: List[str]) -> ModelRequestData:
def load_qwen_vl_chat(question: str,
image_urls: List[str]) -> ModelRequestData:
image_urls: list[str]) -> ModelRequestData:
model_name = "Qwen/Qwen-VL-Chat"
llm = LLM(
model=model_name,
@ -336,7 +336,7 @@ def load_qwen_vl_chat(question: str,
)
def load_qwen2_vl(question, image_urls: List[str]) -> ModelRequestData:
def load_qwen2_vl(question, image_urls: list[str]) -> ModelRequestData:
try:
from qwen_vl_utils import process_vision_info
except ModuleNotFoundError:
@ -393,7 +393,7 @@ def load_qwen2_vl(question, image_urls: List[str]) -> ModelRequestData:
)
def load_qwen2_5_vl(question, image_urls: List[str]) -> ModelRequestData:
def load_qwen2_5_vl(question, image_urls: list[str]) -> ModelRequestData:
try:
from qwen_vl_utils import process_vision_info
except ModuleNotFoundError:
@ -466,7 +466,7 @@ model_example_map = {
}
def run_generate(model, question: str, image_urls: List[str]):
def run_generate(model, question: str, image_urls: list[str]):
req_data = model_example_map[model](question, image_urls)
sampling_params = SamplingParams(temperature=0.0,
@ -487,7 +487,7 @@ def run_generate(model, question: str, image_urls: List[str]):
print(generated_text)
def run_chat(model: str, question: str, image_urls: List[str]):
def run_chat(model: str, question: str, image_urls: list[str]):
req_data = model_example_map[model](question, image_urls)
sampling_params = SamplingParams(temperature=0.0,

View File

@ -1,61 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
import time
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
# Create a Whisper encoder/decoder model instance
llm = LLM(
model="openai/whisper-large-v3",
max_model_len=448,
max_num_seqs=400,
limit_mm_per_prompt={"audio": 1},
kv_cache_dtype="fp8",
)
prompts = [
{
"prompt": "<|startoftranscript|>",
"multi_modal_data": {
"audio": AudioAsset("mary_had_lamb").audio_and_sample_rate,
},
},
{ # Test explicit encoder/decoder prompt
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": AudioAsset("winning_call").audio_and_sample_rate,
},
},
"decoder_prompt": "<|startoftranscript|>",
}
] * 1024
# Create a sampling params object.
sampling_params = SamplingParams(
temperature=0,
top_p=1.0,
max_tokens=200,
)
start = time.time()
# Generate output tokens from the prompts. The output is a list of
# RequestOutput objects that contain the prompt, generated
# text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
encoder_prompt = output.encoder_prompt
generated_text = output.outputs[0].text
print(f"Encoder prompt: {encoder_prompt!r}, "
f"Decoder prompt: {prompt!r}, "
f"Generated text: {generated_text!r}")
duration = time.time() - start
print("Duration:", duration)
print("RPS:", len(prompts) / duration)

View File

@ -7,7 +7,7 @@ For production use, we recommend `vllm serve` and the OpenAI client API.
import argparse
import json
from typing import Iterable, List
from collections.abc import Iterable
import requests
@ -39,7 +39,7 @@ def post_http_request(prompt: str,
return response
def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
def get_streaming_response(response: requests.Response) -> Iterable[list[str]]:
for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\0"):
@ -49,7 +49,7 @@ def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
yield output
def get_response(response: requests.Response) -> List[str]:
def get_response(response: requests.Response) -> list[str]:
data = json.loads(response.content)
output = data["text"]
return output

View File

@ -0,0 +1,64 @@
# SPDX-License-Identifier: Apache-2.0
"""
An example shows how to generate structured outputs from reasoning models
like DeepSeekR1. The thinking process will not be guided by the JSON
schema provided by the user. Only the final output will be structured.
To run this example, you need to start the vLLM server with the reasoning
parser:
```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--enable-reasoning --reasoning-parser deepseek_r1
```
This example demonstrates how to generate chat completions from reasoning models
using the OpenAI Python client library.
"""
from enum import Enum
from openai import OpenAI
from pydantic import BaseModel
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Guided decoding by JSON using Pydantic schema
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
prompt = ("Generate a JSON with the brand, model and car_type of"
"the most iconic car from the 90's, think in 100 tokens")
completion = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": prompt,
}],
extra_body={"guided_json": json_schema},
)
print("content", completion.choices[0].message.content)
print("reasoning_content: ", completion.choices[0].message.reasoning_content)

View File

@ -24,4 +24,4 @@ responses = client.embeddings.create(
)
for data in responses.data:
print(data.embedding) # list of float of len 4096
print(data.embedding) # List of float of len 4096

View File

@ -65,6 +65,32 @@ exclude = [
[tool.ruff.lint.per-file-ignores]
"vllm/version.py" = ["F401"]
"vllm/_version.py" = ["ALL"]
# Python 3.8 typing. TODO: Remove these excludes after v1.0.0
"vllm/adapter_commons/**/*.py" = ["UP006", "UP035"]
"vllm/attention/**/*.py" = ["UP006", "UP035"]
"vllm/compilation/**/*.py" = ["UP006", "UP035"]
"vllm/core/**/*.py" = ["UP006", "UP035"]
"vllm/device_allocator/**/*.py" = ["UP006", "UP035"]
"vllm/distributed/**/*.py" = ["UP006", "UP035"]
"vllm/engine/**/*.py" = ["UP006", "UP035"]
"vllm/executor/**/*.py" = ["UP006", "UP035"]
"vllm/inputs/**/*.py" = ["UP006", "UP035"]
"vllm/logging_utils/**/*.py" = ["UP006", "UP035"]
"vllm/lora/**/*.py" = ["UP006", "UP035"]
"vllm/model_executor/**/*.py" = ["UP006", "UP035"]
"vllm/multimodal/**/*.py" = ["UP006", "UP035"]
"vllm/platforms/**/*.py" = ["UP006", "UP035"]
"vllm/plugins/**/*.py" = ["UP006", "UP035"]
"vllm/profiler/**/*.py" = ["UP006", "UP035"]
"vllm/prompt_adapter/**/*.py" = ["UP006", "UP035"]
"vllm/spec_decode/**/*.py" = ["UP006", "UP035"]
"vllm/third_party/**/*.py" = ["UP006", "UP035"]
"vllm/transformers_utils/**/*.py" = ["UP006", "UP035"]
"vllm/triton_utils/**/*.py" = ["UP006", "UP035"]
"vllm/usage/**/*.py" = ["UP006", "UP035"]
"vllm/vllm_flash_attn/**/*.py" = ["UP006", "UP035"]
"vllm/assets/**/*.py" = ["UP006", "UP035"]
"vllm/worker/**/*.py" = ["UP006", "UP035"]
[tool.ruff.lint]
select = [
@ -91,8 +117,6 @@ ignore = [
"B007",
# f-string format
"UP032",
# Python 3.8 typing
"UP006", "UP035",
# Can remove once 3.10+ is the minimum Python version
"UP007",
]

View File

@ -1,7 +1,6 @@
psutil
sentencepiece # Required for LLaMA tokenizer.
numpy < 2.0.0
numba == 0.60.0 # v0.61 doesn't support Python 3.9. Required for N-gram speculative decoding.
requests >= 2.26.0
tqdm
blake3
@ -38,3 +37,4 @@ depyf==0.18.0 # required for profiling and debugging with compilation config
cloudpickle # allows pickling lambda functions in model_executor/models/registry.py
watchfiles # required for http server to monitor the updates of TLS files
python-json-logger # Used by logging as per examples/other/logging_configuration.md
scipy # Required for phi-4-multimodal-instruct

View File

@ -1,8 +1,10 @@
# Common dependencies
-r requirements-common.txt
numba == 0.60.0 # v0.61 doesn't support Python 3.9. Required for N-gram speculative decoding
# Dependencies for NVIDIA GPUs
ray[adag] == 2.40.0 # Required for pipeline parallelism in V1.
ray[cgraph] >= 2.43.0 # Ray Compiled Graph, required for pipeline parallelism in V1.
torch == 2.5.1
torchaudio==2.5.1
# These must be updated alongside torch

View File

@ -1,6 +1,8 @@
# Common dependencies
-r requirements-common.txt
numba == 0.60.0 # v0.61 doesn't support Python 3.9. Required for N-gram speculative decoding
# Dependencies for AMD GPUs
awscli
boto3

View File

@ -16,7 +16,7 @@ vector_quantize_pytorch # required for minicpmo_26 test
vocos # required for minicpmo_26 test
peft
pqdm
ray[adag]==2.40.0
ray[cgraph]>=2.43.0 # Ray Compiled Graph, required by pipeline parallelism tests
sentence-transformers # required for embedding tests
soundfile # required for audio tests
jiwer # required for audio tests

View File

@ -472,7 +472,7 @@ pyyaml==6.0.2
# vocos
rapidfuzz==3.12.1
# via jiwer
ray==2.40.0
ray==2.43.0
# via -r requirements-test.in
redis==5.2.0
# via tensorizer

View File

@ -17,9 +17,9 @@ ray[default]
--find-links https://storage.googleapis.com/libtpu-releases/index.html
--find-links https://storage.googleapis.com/jax-releases/jax_nightly_releases.html
--find-links https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
torch @ https://download.pytorch.org/whl/nightly/cpu/torch-2.6.0.dev20241216%2Bcpu-cp39-cp39-linux_x86_64.whl ; python_version == "3.9"
torch @ https://download.pytorch.org/whl/nightly/cpu/torch-2.6.0.dev20241216%2Bcpu-cp310-cp310-linux_x86_64.whl ; python_version == "3.10"
torch @ https://download.pytorch.org/whl/nightly/cpu/torch-2.6.0.dev20241216%2Bcpu-cp311-cp311-linux_x86_64.whl ; python_version == "3.11"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250124-cp39-cp39-linux_x86_64.whl ; python_version == "3.9"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250124-cp310-cp310-linux_x86_64.whl ; python_version == "3.10"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250124-cp311-cp311-linux_x86_64.whl ; python_version == "3.11"
torch @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-2.7.0.dev20250227%2Bcxx11-cp39-cp39-linux_x86_64.whl ; python_version == "3.9"
torch @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-2.7.0.dev20250227%2Bcxx11-cp310-cp310-linux_x86_64.whl ; python_version == "3.10"
torch @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-2.7.0.dev20250227%2Bcxx11-cp311-cp311-linux_x86_64.whl ; python_version == "3.11"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250227%2Bcxx11-cp39-cp39-linux_x86_64.whl ; python_version == "3.9"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250227%2Bcxx11-cp310-cp310-linux_x86_64.whl ; python_version == "3.10"
torch_xla[tpu, pallas] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.7.0.dev20250227%2Bcxx11-cp311-cp311-linux_x86_64.whl ; python_version == "3.11"

View File

@ -2,6 +2,7 @@
import ctypes
import importlib.util
import json
import logging
import os
import re
@ -9,7 +10,6 @@ import subprocess
import sys
from pathlib import Path
from shutil import which
from typing import Dict, List
import torch
from packaging.version import Version, parse
@ -78,7 +78,7 @@ class CMakeExtension(Extension):
class cmake_build_ext(build_ext):
# A dict of extension directories that have been configured.
did_config: Dict[str, bool] = {}
did_config: dict[str, bool] = {}
#
# Determine number of compilation jobs and optionally nvcc compile threads.
@ -270,9 +270,32 @@ class repackage_wheel(build_ext):
"""Extracts libraries and other files from an existing wheel."""
def get_base_commit_in_main_branch(self) -> str:
import subprocess
# Force to use the nightly wheel. This is mainly used for CI testing.
if envs.VLLM_TEST_USE_PRECOMPILED_NIGHTLY_WHEEL:
return "nightly"
try:
# Get the latest commit hash of the upstream main branch.
resp_json = subprocess.check_output([
"curl", "-s",
"https://api.github.com/repos/vllm-project/vllm/commits/main"
]).decode("utf-8")
upstream_main_commit = json.loads(resp_json)["sha"]
# Check if the local main branch is up-to-date. This is to ensure
# the base commit we found is the most recent commit on the main
# branch.
local_main_commit = subprocess.check_output(
["git", "rev-parse", "main"]).decode("utf-8").strip()
if local_main_commit != upstream_main_commit:
raise ValueError(
f"Local main branch ({local_main_commit}) is not "
"up-to-date with upstream main branch "
f"({upstream_main_commit}). Please pull the latest "
"changes from upstream main branch first.")
# Then get the commit hash of the current branch that is the same as
# the upstream main commit.
current_branch = subprocess.check_output(
["git", "branch", "--show-current"]).decode("utf-8").strip()
@ -280,6 +303,8 @@ class repackage_wheel(build_ext):
["git", "merge-base", "main",
current_branch]).decode("utf-8").strip()
return base_commit
except ValueError as err:
raise ValueError(err) from None
except Exception as err:
logger.warning(
"Failed to get the base commit in the main branch. "
@ -548,10 +573,10 @@ def get_vllm_version() -> str:
return version
def get_requirements() -> List[str]:
def get_requirements() -> list[str]:
"""Get Python package dependencies from requirements.txt."""
def _read_requirements(filename: str) -> List[str]:
def _read_requirements(filename: str) -> list[str]:
with open(get_path(filename)) as f:
requirements = f.read().strip().split("\n")
resolved_requirements = []

View File

@ -1,6 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
"""vllm.entrypoints.api_server with some extra logging for testing."""
from typing import Any, Dict, Iterable
from collections.abc import Iterable
from typing import Any
import uvicorn
from fastapi.responses import JSONResponse, Response
@ -24,7 +25,7 @@ class AsyncLLMEngineWithStats(AsyncLLMEngine):
self._num_aborts += len(ids)
await super()._engine_abort(ids)
def testing_stats(self) -> Dict[str, Any]:
def testing_stats(self) -> dict[str, Any]:
return {"num_aborted_requests": self._num_aborts}

View File

@ -6,7 +6,7 @@ import uuid
from asyncio import CancelledError
from copy import copy
from dataclasses import dataclass
from typing import List, Optional
from typing import Optional
import pytest
import pytest_asyncio
@ -254,7 +254,7 @@ async def test_output_kinds(async_engine, stop):
params.output_kind = RequestOutputKind.DELTA
prompt_tokens = None
output_tokens: List[int] = []
output_tokens: list[int] = []
output_text = ""
output_count = 0
final_output = None

View File

@ -13,21 +13,26 @@ class TestBackend:
This class provides a simple Inductor backend that can be used for testing.
It takes a list of custom passes and runs them after Inductor's passes.
It also saves the graph before and after the custom passes for inspection.
Inductor config can be modified directly by editing the inductor_config
property. This can be helpful for adding passes like the
'pre_grad_custom_pass' and the 'post_grad_custom_pre_pass'.
"""
def __init__(self, *passes: Union[InductorPass, Callable[[fx.Graph],
None]]):
self.custom_passes = list(passes)
from torch._inductor import config
self.current_config = config.shallow_copy_dict()
self.current_config['force_disable_caches'] = True
self.current_config['post_grad_custom_post_pass'] = self.post_pass
self.inductor_config = config.shallow_copy_dict()
self.inductor_config['force_disable_caches'] = True
self.inductor_config['post_grad_custom_post_pass'] = self.post_pass
def __call__(self, graph: fx.GraphModule, example_inputs):
self.graph_pre_compile = deepcopy(graph)
from torch._inductor.compile_fx import compile_fx
return compile_fx(graph,
example_inputs,
config_patches=self.current_config)
config_patches=self.inductor_config)
def post_pass(self, graph: fx.Graph):
self.graph_pre_pass = deepcopy(graph)

View File

@ -8,7 +8,7 @@ if the config `tractable_init` is set to True. Otherwise, the weights are
initialized randomly with a fixed seed.
"""
from dataclasses import dataclass
from typing import Any, List, Optional, Tuple
from typing import Any, Optional
import torch
from torch import nn
@ -56,7 +56,7 @@ class LlamaConfig:
random_seed: int = 0
def compute_hash(self) -> str:
factors: List[Any] = []
factors: list[Any] = []
for k, v in self.__dict__.items():
if k == "random_seed":
continue
@ -174,7 +174,7 @@ class LlamaDecoderLayer(nn.Module):
positions: torch.Tensor,
hidden_states: torch.Tensor,
residual: Optional[torch.Tensor],
) -> Tuple[torch.Tensor, torch.Tensor]:
) -> tuple[torch.Tensor, torch.Tensor]:
"""
For tractable computation:
- if residual is None, the outputs are:

Some files were not shown because too many files have changed in this diff Show More