Compare commits

..

1 Commits

Author SHA1 Message Date
b6381ced9c updated
Signed-off-by: Robert Shaw <robshaw@redhat.com>
2025-07-15 13:50:42 +00:00
684 changed files with 31338 additions and 37953 deletions

View File

@ -74,7 +74,7 @@ Here is an example of one test inside `latency-tests.json`:
In this example: In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`. - The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15` - The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly. Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
@ -82,13 +82,13 @@ WARNING: The benchmarking script will save json results by itself, so please do
### Throughput test ### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`. The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot. The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
### Serving test ### Serving test
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example: We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
```json ```json
[ [
@ -118,8 +118,8 @@ Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`. - The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server. - The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `vllm bench serve`. - The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve` - The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly. The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

View File

@ -100,7 +100,7 @@ if __name__ == "__main__":
raw_result = json.loads(f.read()) raw_result = json.loads(f.read())
if "serving" in str(test_file): if "serving" in str(test_file):
# this result is generated via `vllm bench serve` command # this result is generated via `benchmark_serving.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
try: try:
@ -120,7 +120,7 @@ if __name__ == "__main__":
continue continue
elif "latency" in f.name: elif "latency" in f.name:
# this result is generated via `vllm bench latency` command # this result is generated via `benchmark_latency.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
try: try:
@ -148,7 +148,7 @@ if __name__ == "__main__":
continue continue
elif "throughput" in f.name: elif "throughput" in f.name:
# this result is generated via `vllm bench throughput` command # this result is generated via `benchmark_throughput.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
try: try:

View File

@ -73,7 +73,7 @@ get_current_llm_serving_engine() {
echo "Container: vllm" echo "Container: vllm"
# move to a completely irrelevant directory, to avoid import vllm from current folder # move to a completely irrelevant directory, to avoid import vllm from current folder
export CURRENT_LLM_SERVING_ENGINE=vllm export CURRENT_LLM_SERVING_ENGINE=vllm
return return
fi fi
} }
@ -95,14 +95,12 @@ json2args() {
} }
kill_gpu_processes() { kill_gpu_processes() {
pkill -f '[p]ython' pkill -f python
pkill -f '[p]ython3' pkill -f python3
pkill -f '[t]ritonserver' pkill -f tritonserver
pkill -f '[p]t_main_thread' pkill -f pt_main_thread
pkill -f '[t]ext-generation' pkill -f text-generation
pkill -f '[l]mdeploy' pkill -f lmdeploy
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pkill -f '[V]LLM'
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1 sleep 1
@ -127,7 +125,7 @@ ensure_installed() {
} }
run_serving_tests() { run_serving_tests() {
# run serving tests using `vllm bench serve` command # run serving tests using `benchmark_serving.py`
# $1: a json file specifying serving test cases # $1: a json file specifying serving test cases
local serving_test_file local serving_test_file
@ -227,7 +225,7 @@ run_serving_tests() {
if [[ "$dataset_name" = "sharegpt" ]]; then if [[ "$dataset_name" = "sharegpt" ]]; then
client_command="vllm bench serve \ client_command="python3 benchmark_serving.py \
--backend $backend \ --backend $backend \
--tokenizer /tokenizer_cache \ --tokenizer /tokenizer_cache \
--model $model \ --model $model \
@ -248,7 +246,7 @@ run_serving_tests() {
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len') sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len') sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
client_command="vllm bench serve \ client_command="python3 benchmark_serving.py \
--backend $backend \ --backend $backend \
--tokenizer /tokenizer_cache \ --tokenizer /tokenizer_cache \
--model $model \ --model $model \
@ -267,13 +265,13 @@ run_serving_tests() {
$client_args" $client_args"
else else
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name." echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
exit 1 exit 1
fi fi
echo "Running test case $test_name with qps $qps" echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command" echo "Client command: $client_command"
@ -304,7 +302,7 @@ run_serving_tests() {
} }
run_genai_perf_tests() { run_genai_perf_tests() {
# run genai-perf tests # run genai-perf tests
# $1: a json file specifying genai-perf test cases # $1: a json file specifying genai-perf test cases
local genai_perf_test_file local genai_perf_test_file
@ -313,14 +311,14 @@ run_genai_perf_tests() {
# Iterate over genai-perf tests # Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it. # get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name') test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector # if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name." echo "Skip test case $test_name."
continue continue
fi fi
# prepend the current serving engine to the test name # prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name} test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
@ -371,10 +369,10 @@ run_genai_perf_tests() {
qps=$num_prompts qps=$num_prompts
echo "now qps is $qps" echo "now qps is $qps"
fi fi
new_test_name=$test_name"_qps_"$qps new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE backend=$CURRENT_LLM_SERVING_ENGINE
if [[ "$backend" == *"vllm"* ]]; then if [[ "$backend" == *"vllm"* ]]; then
backend="vllm" backend="vllm"
fi fi
@ -415,7 +413,7 @@ prepare_dataset() {
do do
cat sonnet.txt >> sonnet_4x.txt cat sonnet.txt >> sonnet_4x.txt
done done
} }
main() { main() {

View File

@ -126,8 +126,7 @@ kill_gpu_processes() {
ps -aux ps -aux
lsof -t -i:8000 | xargs -r kill -9 lsof -t -i:8000 | xargs -r kill -9
pgrep python3 | xargs -r kill -9 pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
# wait until GPU memory usage smaller than 1GB # wait until GPU memory usage smaller than 1GB
if command -v nvidia-smi; then if command -v nvidia-smi; then
@ -165,7 +164,7 @@ upload_to_buildkite() {
} }
run_latency_tests() { run_latency_tests() {
# run latency tests using `vllm bench latency` command # run latency tests using `benchmark_latency.py`
# $1: a json file specifying latency test cases # $1: a json file specifying latency test cases
local latency_test_file local latency_test_file
@ -206,7 +205,7 @@ run_latency_tests() {
fi fi
fi fi
latency_command=" $latency_envs vllm bench latency \ latency_command=" $latency_envs python3 benchmark_latency.py \
--output-json $RESULTS_FOLDER/${test_name}.json \ --output-json $RESULTS_FOLDER/${test_name}.json \
$latency_args" $latency_args"
@ -232,7 +231,7 @@ run_latency_tests() {
} }
run_throughput_tests() { run_throughput_tests() {
# run throughput tests using `vllm bench throughput` # run throughput tests using `benchmark_throughput.py`
# $1: a json file specifying throughput test cases # $1: a json file specifying throughput test cases
local throughput_test_file local throughput_test_file
@ -273,7 +272,7 @@ run_throughput_tests() {
fi fi
fi fi
throughput_command=" $throughput_envs vllm bench throughput \ throughput_command=" $throughput_envs python3 benchmark_throughput.py \
--output-json $RESULTS_FOLDER/${test_name}.json \ --output-json $RESULTS_FOLDER/${test_name}.json \
$throughput_args" $throughput_args"
@ -298,7 +297,7 @@ run_throughput_tests() {
} }
run_serving_tests() { run_serving_tests() {
# run serving tests using `vllm bench serve` command # run serving tests using `benchmark_serving.py`
# $1: a json file specifying serving test cases # $1: a json file specifying serving test cases
local serving_test_file local serving_test_file
@ -394,7 +393,7 @@ run_serving_tests() {
# pass the tensor parallel size to the client so that it can be displayed # pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard # on the benchmark dashboard
client_command="vllm bench serve \ client_command="python3 benchmark_serving.py \
--save-result \ --save-result \
--result-dir $RESULTS_FOLDER \ --result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \ --result-filename ${new_test_name}.json \
@ -448,7 +447,7 @@ main() {
(which jq) || (apt-get update && apt-get -y install jq) (which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof) (which lsof) || (apt-get update && apt-get install -y lsof)
# get the current IP address, required by `vllm bench serve` command # get the current IP address, required by benchmark_serving.py
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}') export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# turn of the reporting of the status of each request, to clean up the terminal output # turn of the reporting of the status of each request, to clean up the terminal output
export VLLM_LOGGING_LEVEL="WARNING" export VLLM_LOGGING_LEVEL="WARNING"

View File

@ -108,6 +108,7 @@ fi
if [[ $commands == *" kernels/attention"* ]]; then if [[ $commands == *" kernels/attention"* ]]; then
commands="${commands} \ commands="${commands} \
--ignore=kernels/attention/test_attention_selector.py \ --ignore=kernels/attention/test_attention_selector.py \
--ignore=kernels/attention/test_blocksparse_attention.py \
--ignore=kernels/attention/test_encoder_decoder_attn.py \ --ignore=kernels/attention/test_encoder_decoder_attn.py \
--ignore=kernels/attention/test_flash_attn.py \ --ignore=kernels/attention/test_flash_attn.py \
--ignore=kernels/attention/test_flashinfer.py \ --ignore=kernels/attention/test_flashinfer.py \

View File

@ -6,16 +6,15 @@ set -ex
# allow to bind to different cores # allow to bind to different cores
CORE_RANGE=${CORE_RANGE:-48-95} CORE_RANGE=${CORE_RANGE:-48-95}
# used for TP/PP E2E test
OMP_CORE_RANGE=${OMP_CORE_RANGE:-48-95} OMP_CORE_RANGE=${OMP_CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1} NUMA_NODE=${NUMA_NODE:-1}
export CMAKE_BUILD_PARALLEL_LEVEL=32 export CMAKE_BUILD_PARALLEL_LEVEL=32
# Setup cleanup # Setup cleanup
remove_docker_container() { remove_docker_container() {
set -e; set -e;
docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true; docker rm -f cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"-avx2 || true;
} }
trap remove_docker_container EXIT trap remove_docker_container EXIT
remove_docker_container remove_docker_container
@ -25,8 +24,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu . numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu .
# Run the image, setting --shm-size=4g for tensor parallel. # Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE" docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2 docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
function cpu_tests() { function cpu_tests() {
set -e set -e
@ -69,7 +68,7 @@ function cpu_tests() {
docker exec cpu-test-"$NUMA_NODE" bash -c " docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e set -e
pytest -s -v \ pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]" tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
# Note: disable it until supports V1 # Note: disable it until supports V1
# Run AWQ test # Run AWQ test
@ -79,16 +78,17 @@ function cpu_tests() {
# tests/quantization/test_ipex_quant.py" # tests/quantization/test_ipex_quant.py"
# online serving # online serving
docker exec cpu-test-"$NUMA_NODE" bash -c ' docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e set -e
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 & python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half &
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1 timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
vllm bench serve \ VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--dataset-name random \ --dataset-name random \
--model meta-llama/Llama-3.2-3B-Instruct \ --model facebook/opt-125m \
--num-prompts 20 \ --num-prompts 20 \
--endpoint /v1/completions' --endpoint /v1/completions \
--tokenizer facebook/opt-125m"
# Run multi-lora tests # Run multi-lora tests
docker exec cpu-test-"$NUMA_NODE" bash -c " docker exec cpu-test-"$NUMA_NODE" bash -c "

View File

@ -6,17 +6,19 @@ set -exuo pipefail
# Try building the docker image # Try building the docker image
cat <<EOF | docker build -t hpu-plugin-v1-test-env -f - . cat <<EOF | docker build -t hpu-plugin-v1-test-env -f - .
FROM gaudi-base-image:latest FROM 1.22-413-pt2.7.1:latest
COPY ./ /workspace/vllm COPY ./ /workspace/vllm
WORKDIR /workspace/vllm WORKDIR /workspace/vllm
RUN pip install -v -r requirements/hpu.txt
RUN pip install git+https://github.com/vllm-project/vllm-gaudi.git
ENV no_proxy=localhost,127.0.0.1 ENV no_proxy=localhost,127.0.0.1
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
RUN VLLM_TARGET_DEVICE=empty pip install . RUN VLLM_TARGET_DEVICE=hpu python3 setup.py install
RUN pip install git+https://github.com/vllm-project/vllm-gaudi.git
# install development dependencies (for testing) # install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils RUN python3 -m pip install -e tests/vllm_test_utils

View File

@ -1,166 +0,0 @@
#!/bin/bash
set -xu
remove_docker_container() {
docker rm -f tpu-test || true;
docker rm -f vllm-tpu || true;
}
trap remove_docker_container EXIT
# Remove the container that might not be cleaned up in the previous run.
remove_docker_container
# Build the docker image.
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
# Set up cleanup.
cleanup_docker() {
# Get Docker's root directory
docker_root=$(docker info -f '{{.DockerRootDir}}')
if [ -z "$docker_root" ]; then
echo "Failed to determine Docker root directory."
exit 1
fi
echo "Docker root directory: $docker_root"
# Check disk usage of the filesystem where Docker's root directory is located
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
# Define the threshold
threshold=70
if [ "$disk_usage" -gt "$threshold" ]; then
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune --force --filter "until=72h" --all
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
fi
}
cleanup_docker
# For HF_TOKEN.
source /etc/environment
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c '
set -e # Exit immediately if a command exits with a non-zero status.
set -u # Treat unset variables as an error.
echo "--- Starting script inside Docker container ---"
# Create results directory
RESULTS_DIR=$(mktemp -d)
# If mktemp fails, set -e will cause the script to exit.
echo "Results will be stored in: $RESULTS_DIR"
# Install dependencies
echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \
&& python3 -m pip install --progress-bar off hf-transfer
echo "--- Python dependencies installed ---"
export VLLM_USE_V1=1
export VLLM_XLA_CHECK_RECOMPILATION=1
export VLLM_XLA_CACHE_PATH=
echo "Using VLLM V1"
echo "--- Hardware Information ---"
# tpu-info
echo "--- Starting Tests ---"
set +e
overall_script_exit_code=0
# --- Test Definitions ---
# If a test fails, this function will print logs and will not cause the main script to exit.
run_test() {
local test_num=$1
local test_name=$2
local test_command=$3
local log_file="$RESULTS_DIR/test_${test_num}.log"
local actual_exit_code
echo "--- TEST_$test_num: Running $test_name ---"
# Execute the test command.
eval "$test_command" > >(tee -a "$log_file") 2> >(tee -a "$log_file" >&2)
actual_exit_code=$?
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" # This goes to main log
echo "TEST_${test_num}_COMMAND_EXIT_CODE: $actual_exit_code" >> "$log_file" # Also to per-test log
if [ "$actual_exit_code" -ne 0 ]; then
echo "TEST_$test_num ($test_name) FAILED with exit code $actual_exit_code." >&2
echo "--- Log for failed TEST_$test_num ($test_name) ---" >&2
if [ -f "$log_file" ]; then
cat "$log_file" >&2
else
echo "Log file $log_file not found for TEST_$test_num ($test_name)." >&2
fi
echo "--- End of log for TEST_$test_num ($test_name) ---" >&2
return "$actual_exit_code" # Return the failure code
else
echo "TEST_$test_num ($test_name) PASSED."
return 0 # Return success
fi
}
# Helper function to call run_test and update the overall script exit code
run_and_track_test() {
local test_num_arg="$1"
local test_name_arg="$2"
local test_command_arg="$3"
# Run the test
run_test "$test_num_arg" "$test_name_arg" "$test_command_arg"
local test_specific_exit_code=$?
# If the test failed, set the overall script exit code to 1
if [ "$test_specific_exit_code" -ne 0 ]; then
# No need for extra echo here, run_test already logged the failure.
overall_script_exit_code=1
fi
}
# --- Actual Test Execution ---
run_and_track_test 1 "test_struct_output_generate.py" \
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
run_and_track_test 2 "test_moe_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
run_and_track_test 3 "test_lora.py" \
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
run_and_track_test 4 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 5 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 6 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
# After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then
echo "--- One or more tests FAILED. Overall script exiting with failure code 1. ---"
else
echo "--- All tests have completed and PASSED. Overall script exiting with success code 0. ---"
fi
exit "$overall_script_exit_code"
' # IMPORTANT: This is the closing single quote for the bash -c "..." command. Ensure it is present and correct.
# Capture the exit code of the docker run command
DOCKER_RUN_EXIT_CODE=$?
# The trap will run for cleanup.
# Exit the main script with the Docker run command's exit code.
if [ "$DOCKER_RUN_EXIT_CODE" -ne 0 ]; then
echo "Docker run command failed with exit code $DOCKER_RUN_EXIT_CODE."
exit "$DOCKER_RUN_EXIT_CODE"
else
echo "Docker run command completed successfully."
exit 0
fi
# TODO: This test fails because it uses RANDOM_SEED sampling
# pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \

View File

@ -62,8 +62,7 @@ echo "Results will be stored in: $RESULTS_DIR"
echo "--- Installing Python dependencies ---" echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \ python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \ && python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off lm_eval[api]==0.4.4 \ && python3 -m pip install --progress-bar off lm_eval[api]==0.4.4
&& python3 -m pip install --progress-bar off hf-transfer
echo "--- Python dependencies installed ---" echo "--- Python dependencies installed ---"
export VLLM_USE_V1=1 export VLLM_USE_V1=1
export VLLM_XLA_CHECK_RECOMPILATION=1 export VLLM_XLA_CHECK_RECOMPILATION=1
@ -71,7 +70,7 @@ export VLLM_XLA_CACHE_PATH=
echo "Using VLLM V1" echo "Using VLLM V1"
echo "--- Hardware Information ---" echo "--- Hardware Information ---"
# tpu-info tpu-info
echo "--- Starting Tests ---" echo "--- Starting Tests ---"
set +e set +e
overall_script_exit_code=0 overall_script_exit_code=0
@ -135,7 +134,7 @@ run_and_track_test 1 "test_compilation.py" \
run_and_track_test 2 "test_basic.py" \ run_and_track_test 2 "test_basic.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py" "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_basic.py"
run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \ run_and_track_test 3 "test_accuracy.py::test_lm_eval_accuracy_v1_engine" \
"HF_HUB_DISABLE_XET=1 python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine" "python3 -m pytest -s -v /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine"
run_and_track_test 4 "test_quantization_accuracy.py" \ run_and_track_test 4 "test_quantization_accuracy.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py" "python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py"
run_and_track_test 5 "examples/offline_inference/tpu.py" \ run_and_track_test 5 "examples/offline_inference/tpu.py" \
@ -150,6 +149,18 @@ run_and_track_test 9 "test_multimodal.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py" "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py"
run_and_track_test 10 "test_pallas.py" \ run_and_track_test 10 "test_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py"
run_and_track_test 11 "test_struct_output_generate.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k \"not test_structured_output_with_reasoning_matrices\""
run_and_track_test 12 "test_moe_pallas.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/tpu/test_moe_pallas.py"
run_and_track_test 13 "test_lora.py" \
"VLLM_XLA_CHECK_RECOMPILATION=0 python3 -m pytest -s -v /workspace/vllm/tests/tpu/lora/test_lora.py"
run_and_track_test 14 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 15 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 16 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"
# After all tests have been attempted, exit with the overall status. # After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then if [ "$overall_script_exit_code" -ne 0 ]; then

View File

@ -31,13 +31,4 @@ docker run \
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
cd tests cd tests
pytest -v -s v1/core pytest -v -s v1/core
pytest -v -s v1/engine
pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
pytest -v -s v1/structured_output
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_eagle.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py
pytest -v -s v1/test_serial_utils.py
pytest -v -s v1/test_utils.py
pytest -v -s v1/test_metrics_reader.py
' '

View File

@ -11,10 +11,10 @@ cd "$(dirname "${BASH_SOURCE[0]}")/../.."
(which wget && which curl) || (apt-get update && apt-get install -y wget curl) (which wget && which curl) || (apt-get update && apt-get install -y wget curl)
# run python-based benchmarks and upload the result to buildkite # run python-based benchmarks and upload the result to buildkite
vllm bench latency --output-json latency_results.json 2>&1 | tee benchmark_latency.txt python3 benchmarks/benchmark_latency.py --output-json latency_results.json 2>&1 | tee benchmark_latency.txt
bench_latency_exit_code=$? bench_latency_exit_code=$?
vllm bench throughput --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --output-json throughput_results.json 2>&1 | tee benchmark_throughput.txt
bench_throughput_exit_code=$? bench_throughput_exit_code=$?
# run server-based benchmarks and upload the result to buildkite # run server-based benchmarks and upload the result to buildkite
@ -24,7 +24,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
# wait for server to start, timeout after 600 seconds # wait for server to start, timeout after 600 seconds
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1 timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
vllm bench serve \ python3 benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--dataset-name sharegpt \ --dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \

View File

@ -77,7 +77,7 @@ done
echo "run benchmark test..." echo "run benchmark test..."
echo "logging to $BM_LOG" echo "logging to $BM_LOG"
echo echo
vllm bench serve \ python benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model $MODEL \ --model $MODEL \
--dataset-name sonnet \ --dataset-name sonnet \

View File

@ -159,14 +159,13 @@ steps:
- tests/distributed/test_utils - tests/distributed/test_utils
- tests/distributed/test_pynccl - tests/distributed/test_pynccl
- tests/distributed/test_events - tests/distributed/test_events
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile/test_basic_correctness - tests/compile/test_basic_correctness
- examples/offline_inference/rlhf.py - examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py - examples/offline_inference/rlhf_colocate.py
- tests/examples/offline_inference/data_parallel.py - tests/examples/offline_inference/data_parallel.py
- tests/v1/test_async_llm_dp.py - tests/v1/test_async_llm_dp.py
- tests/v1/test_external_lb_dp.py - tests/v1/test_external_lb_dp.py
- tests/v1/test_internal_lb_dp.py
- tests/v1/test_hybrid_lb_dp.py
- tests/v1/engine/test_engine_core_client.py - tests/v1/engine/test_engine_core_client.py
commands: commands:
# test with tp=2 and external_dp=2 # test with tp=2 and external_dp=2
@ -178,13 +177,12 @@ steps:
- python3 ../examples/offline_inference/data_parallel.py --enforce-eager - python3 ../examples/offline_inference/data_parallel.py --enforce-eager
- TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py
- TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_external_lb_dp.py - TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_external_lb_dp.py
- TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/test_internal_lb_dp.py
- TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/test_hybrid_lb_dp.py
- pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp
- pytest -v -s distributed/test_utils.py - pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py - pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py - pytest -v -s distributed/test_pynccl.py
- pytest -v -s distributed/test_events.py - pytest -v -s distributed/test_events.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests # TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests # when we have multiple distributed example tests
- pushd ../examples/offline_inference - pushd ../examples/offline_inference
@ -229,7 +227,7 @@ steps:
##### 1 GPU test ##### ##### 1 GPU test #####
- label: Regression Test # 5min - label: Regression Test # 5min
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/test_regression - tests/test_regression
@ -268,7 +266,6 @@ steps:
- pytest -v -s v1/structured_output - pytest -v -s v1/structured_output
- pytest -v -s v1/spec_decode - pytest -v -s v1/spec_decode
- pytest -v -s v1/kv_connector/unit - pytest -v -s v1/kv_connector/unit
- pytest -v -s v1/metrics
- pytest -v -s v1/test_serial_utils.py - pytest -v -s v1/test_serial_utils.py
- pytest -v -s v1/test_utils.py - pytest -v -s v1/test_utils.py
- pytest -v -s v1/test_oracle.py - pytest -v -s v1/test_oracle.py
@ -277,11 +274,11 @@ steps:
# VLLM_USE_FLASHINFER_SAMPLER or not on H100. # VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- pytest -v -s v1/e2e - pytest -v -s v1/e2e
# Integration test for streaming correctness (requires special branch). # Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api - pip install -U git+https://github.com/robertgshaw2-neuralmagic/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine - pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
- label: Examples Test # 25min - label: Examples Test # 25min
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/examples" working_dir: "/vllm-workspace/examples"
source_file_dependencies: source_file_dependencies:
- vllm/entrypoints - vllm/entrypoints
@ -315,7 +312,7 @@ steps:
- label: Platform Tests (CUDA) - label: Platform Tests (CUDA)
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/cuda - tests/cuda
@ -333,9 +330,20 @@ steps:
- pytest -v -s samplers - pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
- label: LoRA Test %N # 15min each - label: Speculative decoding tests # 40min
mirror_hardwares: [amdexperimental] mirror_hardwares: [amdexperimental]
source_file_dependencies: source_file_dependencies:
- vllm/spec_decode
- tests/spec_decode
- vllm/model_executor/models/eagle.py
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_mtp_correctness.py
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py
- label: LoRA Test %N # 15min each
mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies:
- vllm/lora - vllm/lora
- tests/lora - tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
@ -386,7 +394,7 @@ steps:
- pytest -v -s kernels/core - pytest -v -s kernels/core
- label: Kernels Attention Test %N - label: Kernels Attention Test %N
mirror_hardwares: [amdexperimental] mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies: source_file_dependencies:
- csrc/attention/ - csrc/attention/
- vllm/attention - vllm/attention
@ -397,7 +405,7 @@ steps:
parallelism: 2 parallelism: 2
- label: Kernels Quantization Test %N - label: Kernels Quantization Test %N
mirror_hardwares: [amdexperimental] mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies: source_file_dependencies:
- csrc/quantization/ - csrc/quantization/
- vllm/model_executor/layers/quantization - vllm/model_executor/layers/quantization
@ -416,7 +424,7 @@ steps:
- pytest -v -s kernels/moe - pytest -v -s kernels/moe
- label: Kernels Mamba Test - label: Kernels Mamba Test
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
source_file_dependencies: source_file_dependencies:
- csrc/mamba/ - csrc/mamba/
- tests/kernels/mamba - tests/kernels/mamba
@ -424,7 +432,7 @@ steps:
- pytest -v -s kernels/mamba - pytest -v -s kernels/mamba
- label: Tensorizer Test # 11min - label: Tensorizer Test # 11min
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
soft_fail: true soft_fail: true
source_file_dependencies: source_file_dependencies:
- vllm/model_executor/model_loader - vllm/model_executor/model_loader
@ -438,6 +446,7 @@ steps:
- label: Model Executor Test - label: Model Executor Test
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental, amdproduction]
soft_fail: true
source_file_dependencies: source_file_dependencies:
- vllm/model_executor - vllm/model_executor
- tests/model_executor - tests/model_executor
@ -494,7 +503,7 @@ steps:
- pytest -s entrypoints/openai/correctness/ - pytest -s entrypoints/openai/correctness/
- label: Encoder Decoder tests # 5min - label: Encoder Decoder tests # 5min
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/encoder_decoder - tests/encoder_decoder
@ -502,7 +511,7 @@ steps:
- pytest -v -s encoder_decoder - pytest -v -s encoder_decoder
- label: OpenAI-Compatible Tool Use # 20 min - label: OpenAI-Compatible Tool Use # 20 min
mirror_hardwares: [amdexperimental, amdproduction] mirror_hardwares: [amdexperimental]
fast_check: false fast_check: false
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
@ -614,7 +623,7 @@ steps:
- pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model' - pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'
- label: Quantized Models Test - label: Quantized Models Test
mirror_hardwares: [amdexperimental] mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies: source_file_dependencies:
- vllm/model_executor/layers/quantization - vllm/model_executor/layers/quantization
- tests/models/quantization - tests/models/quantization
@ -636,7 +645,7 @@ steps:
optional: true optional: true
commands: commands:
- pip install --upgrade git+https://github.com/huggingface/transformers - pip install --upgrade git+https://github.com/huggingface/transformers
- pytest -v -s tests/models/test_initialization.py - pytest -v -s models/test_initialization.py
- pytest -v -s tests/models/multimodal/processing/ - pytest -v -s tests/models/multimodal/processing/
- pytest -v -s tests/models/multimodal/test_mapping.py - pytest -v -s tests/models/multimodal/test_mapping.py
- python3 examples/offline_inference/basic/chat.py - python3 examples/offline_inference/basic/chat.py
@ -717,10 +726,10 @@ steps:
- pytest -v -s distributed/test_sequence_parallel.py - pytest -v -s distributed/test_sequence_parallel.py
# this test fails consistently. # this test fails consistently.
# TODO: investigate and fix # TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
- pytest -v -s models/multimodal/generation/test_maverick.py
- label: Plugin Tests (2 GPUs) # 40min - label: Plugin Tests (2 GPUs) # 40min
mirror_hardwares: [amdexperimental] mirror_hardwares: [amdexperimental]

View File

@ -1,6 +0,0 @@
# https://developers.google.com/gemini-code-assist/docs/customize-gemini-behavior-github
have_fun: false # Just review the code
code_review:
comment_severity_threshold: HIGH # Reduce quantity of comments
pull_request_opened:
summary: false # Don't summarize the PR in a separate comment

15
.github/CODEOWNERS vendored
View File

@ -16,7 +16,7 @@
/vllm/lora @jeejeelee /vllm/lora @jeejeelee
/vllm/reasoning @aarnphm /vllm/reasoning @aarnphm
/vllm/entrypoints @aarnphm /vllm/entrypoints @aarnphm
/vllm/compilation @zou3519 @youkaichao @ProExpertProg /vllm/compilation @zou3519 @youkaichao
CMakeLists.txt @tlrmchlsmth @LucasWilkinson CMakeLists.txt @tlrmchlsmth @LucasWilkinson
# Any change to the VllmConfig changes can have a large user-facing impact, # Any change to the VllmConfig changes can have a large user-facing impact,
@ -43,6 +43,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/tests/multimodal @DarkLight1337 @ywang96 /tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu /tests/prefix_caching @comaniac @KuntaiDu
/tests/quantization @mgoin @robertgshaw2-redhat /tests/quantization @mgoin @robertgshaw2-redhat
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/test_inputs.py @DarkLight1337 @ywang96 /tests/test_inputs.py @DarkLight1337 @ywang96
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm /tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
/tests/v1/structured_output @mgoin @russellb @aarnphm /tests/v1/structured_output @mgoin @russellb @aarnphm
@ -52,15 +53,3 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
# Docs # Docs
/docs @hmellor /docs @hmellor
mkdocs.yaml @hmellor mkdocs.yaml @hmellor
# CPU
/vllm/v1/worker/^cpu @bigPYJ1151
/csrc/cpu @bigPYJ1151
/vllm/platforms/cpu.py @bigPYJ1151
/cmake/cpu_extension.cmake @bigPYJ1151
/docker/Dockerfile.cpu @bigPYJ1151
# Intel GPU
/vllm/v1/worker/^xpu @jikunshang
/vllm/platforms/xpu.py @jikunshang
/docker/Dockerfile.xpu @jikunshang

View File

@ -46,7 +46,7 @@ body:
- type: markdown - type: markdown
attributes: attributes:
value: > value: >
Thanks for contributing 🎉! The vLLM core team hosts a biweekly RFC review session at 9:30AM Pacific Time, while most RFCs can be discussed online, you can optionally sign up for a slot to discuss your RFC online [here](https://docs.google.com/document/d/1CiLVBZeIVfR7_PNAKVSusxpceywkoOOB78qoWqHvSZc/edit). Thanks for contributing 🎉!
- type: checkboxes - type: checkboxes
id: askllm id: askllm
attributes: attributes:

3
.github/mergify.yml vendored
View File

@ -164,7 +164,10 @@ pull_request_rules:
description: Automatically apply speculative-decoding label description: Automatically apply speculative-decoding label
conditions: conditions:
- or: - or:
- files~=^vllm/spec_decode/
- files~=^vllm/v1/spec_decode/ - files~=^vllm/v1/spec_decode/
- files=vllm/model_executor/layers/spec_decode_base_sampler.py
- files~=^tests/spec_decode/
- files~=^tests/v1/spec_decode/ - files~=^tests/v1/spec_decode/
- files~=^examples/.*(spec_decode|mlpspeculator|eagle|speculation).*\.py - files~=^examples/.*(spec_decode|mlpspeculator|eagle|speculation).*\.py
- files~=^vllm/model_executor/models/.*eagle.*\.py - files~=^vllm/model_executor/models/.*eagle.*\.py

View File

@ -7,7 +7,7 @@ permissions:
jobs: jobs:
lint-and-deploy: lint-and-deploy:
runs-on: ubuntu-24.04-arm runs-on: ubuntu-latest
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -21,7 +21,7 @@ repos:
- id: ruff-format - id: ruff-format
files: ^(.buildkite|benchmarks|examples)/.* files: ^(.buildkite|benchmarks|examples)/.*
- repo: https://github.com/crate-ci/typos - repo: https://github.com/crate-ci/typos
rev: v1.34.0 rev: v1.32.0
hooks: hooks:
- id: typos - id: typos
- repo: https://github.com/PyCQA/isort - repo: https://github.com/PyCQA/isort

View File

@ -45,7 +45,7 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1
# requirements.txt files and should be kept consistent. The ROCm torch # requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from docker/Dockerfile.rocm # versions are derived from docker/Dockerfile.rocm
# #
set(TORCH_SUPPORTED_VERSION_CUDA "2.7.1") set(TORCH_SUPPORTED_VERSION_CUDA "2.7.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.7.0") set(TORCH_SUPPORTED_VERSION_ROCM "2.7.0")
# #
@ -296,8 +296,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu" "csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_entry.cu" "csrc/sparse/cutlass/sparse_scaled_mm_entry.cu"
"csrc/cutlass_extensions/common.cpp" "csrc/cutlass_extensions/common.cpp"
"csrc/attention/mla/cutlass_mla_entry.cu" "csrc/attention/mla/cutlass_mla_entry.cu")
"csrc/quantization/fp8/per_token_group_quant.cu")
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
SRCS "${VLLM_EXT_SRC}" SRCS "${VLLM_EXT_SRC}"
@ -578,7 +577,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# if it's possible to compile MoE kernels that use its output. # if it's possible to compile MoE kernels that use its output.
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a" "${CUDA_ARCHS}") cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS) if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm90.cu") set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu")
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
SRCS "${SRCS}" SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}") CUDA_ARCHS "${SCALED_MM_ARCHS}")
@ -596,26 +595,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif() endif()
endif() endif()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x_sm100.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
message(STATUS "Building grouped_mm_c3x for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
message(STATUS "Not building grouped_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.8, we recommend upgrading to CUDA 12.8 or later "
"if you intend on running FP8 quantized MoE models on Blackwell.")
else()
message(STATUS "Not building grouped_mm_c3x as no compatible archs found "
"in CUDA target architectures.")
endif()
endif()
# moe_data.cu is used by all CUTLASS MoE kernels. # moe_data.cu is used by all CUTLASS MoE kernels.
cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}") cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS) if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS)
@ -635,7 +614,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"in CUDA target architectures.") "in CUDA target architectures.")
endif() endif()
endif() endif()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}") cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS) if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu") set(SRCS "csrc/quantization/cutlass_w8a8/moe/blockwise_scaled_group_mm_sm100.cu")
@ -768,14 +747,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu") list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu")
endif() endif()
if(VLLM_GPU_LANG STREQUAL "CUDA")
set(MOE_PERMUTE_SRC
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
"csrc/moe/moe_permute_unpermute_op.cu")
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
endif()
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
SRCS "${VLLM_MOE_EXT_SRC}" SRCS "${VLLM_MOE_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}") CUDA_ARCHS "${CUDA_ARCHS}")
@ -844,6 +815,17 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif() endif()
endif() endif()
if(VLLM_GPU_LANG STREQUAL "CUDA")
set(MOE_PERMUTE_SRC
"csrc/moe/permute_unpermute_kernels/moe_permute_unpermute_kernel.cu"
"csrc/moe/moe_permute_unpermute_op.cu")
set_gencode_flags_for_srcs(
SRCS "${MARLIN_PERMUTE_SRC}"
CUDA_ARCHS "${MOE_PERMUTE_ARCHS}")
list(APPEND VLLM_MOE_EXT_SRC "${MOE_PERMUTE_SRC}")
endif()
message(STATUS "Enabling moe extension.") message(STATUS "Enabling moe extension.")
define_gpu_extension_target( define_gpu_extension_target(
_moe_C _moe_C

View File

@ -52,36 +52,3 @@ After branch cut, we approach finalizing the release branch with clear criteria
* Release branch specific changes (e.g. change version identifiers or CI fixes) * Release branch specific changes (e.g. change version identifiers or CI fixes)
Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes. Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes.
## Manual validations
### E2E Performance Validation
Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI.
**Current Coverage:**
* Models: Llama3, Llama4, and Mixtral
* Hardware: NVIDIA H100 and AMD MI300x
* *Note: Coverage may change based on new model releases and hardware availability*
**Performance Validation Process:**
**Step 1: Get Access**
Request write access to the [pytorch/pytorch-integration-testing](https://github.com/pytorch/pytorch-integration-testing) repository to run the benchmark workflow.
**Step 2: Review Benchmark Setup**
Familiarize yourself with the benchmark configurations:
* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda)
* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm)
**Step 3: Run the Benchmark**
Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure:
* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`)
* **vLLM commit**: Set to the RC commit hash
**Step 4: Review Results**
Once the workflow completes, benchmark results will be available on the [vLLM benchmark dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) under the corresponding branch and commit.
**Step 5: Performance Comparison**
Compare the current results against the previous release to verify no performance regressions have occurred. Here is an
example of [v0.9.1 vs v0.9.2](https://hud.pytorch.org/benchmark/llms?startTime=Thu%2C%2017%20Apr%202025%2021%3A43%3A50%20GMT&stopTime=Wed%2C%2016%20Jul%202025%2021%3A43%3A50%20GMT&granularity=week&lBranch=releases/v0.9.1&lCommit=b6553be1bc75f046b00046a4ad7576364d03c835&rBranch=releases/v0.9.2&rCommit=a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f&repoName=vllm-project%2Fvllm&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=All%20Devices&archName=All%20Platforms).

View File

@ -98,7 +98,7 @@ Then run the benchmarking script
```bash ```bash
# download dataset # download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \ --model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \ --endpoint /v1/completions \
@ -111,25 +111,25 @@ If successful, you will see the following output
``` ```
============ Serving Benchmark Result ============ ============ Serving Benchmark Result ============
Successful requests: 10 Successful requests: 10
Benchmark duration (s): 5.78 Benchmark duration (s): 5.78
Total input tokens: 1369 Total input tokens: 1369
Total generated tokens: 2212 Total generated tokens: 2212
Request throughput (req/s): 1.73 Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89 Output token throughput (tok/s): 382.89
Total Token throughput (tok/s): 619.85 Total Token throughput (tok/s): 619.85
---------------Time to First Token---------------- ---------------Time to First Token----------------
Mean TTFT (ms): 71.54 Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88 Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49 P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------ -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91 Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96 Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03 P99 TPOT (ms): 8.03
---------------Inter-token Latency---------------- ---------------Inter-token Latency----------------
Mean ITL (ms): 7.74 Mean ITL (ms): 7.74
Median ITL (ms): 7.70 Median ITL (ms): 7.70
P99 ITL (ms): 8.39 P99 ITL (ms): 8.39
================================================== ==================================================
``` ```
@ -141,7 +141,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
{"prompt": "What is the capital of India?"} {"prompt": "What is the capital of India?"}
{"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of Iran?"}
{"prompt": "What is the capital of China?"} {"prompt": "What is the capital of China?"}
``` ```
```bash ```bash
# start server # start server
@ -150,7 +150,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests
```bash ```bash
# run benchmarking script # run benchmarking script
vllm bench serve --port 9001 --save-result --save-detailed \ python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \
--backend vllm \ --backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \ --model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/completions \ --endpoint /v1/completions \
@ -174,7 +174,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
``` ```
```bash ```bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \ --backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \ --endpoint /v1/chat/completions \
@ -194,7 +194,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
``` ```
``` bash ``` bash
vllm bench serve \ python3 benchmarks/benchmark_serving.py \
--model meta-llama/Meta-Llama-3-8B-Instruct \ --model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset-name hf \ --dataset-name hf \
--dataset-path likaixin/InstructCoder \ --dataset-path likaixin/InstructCoder \
@ -210,7 +210,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
**`lmms-lab/LLaVA-OneVision-Data`** **`lmms-lab/LLaVA-OneVision-Data`**
```bash ```bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \ --backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \ --endpoint /v1/chat/completions \
@ -224,7 +224,7 @@ vllm bench serve \
**`Aeala/ShareGPT_Vicuna_unfiltered`** **`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash ```bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \ --backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \ --endpoint /v1/chat/completions \
@ -237,7 +237,7 @@ vllm bench serve \
**`AI-MO/aimo-validation-aime`** **`AI-MO/aimo-validation-aime`**
``` bash ``` bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--model Qwen/QwQ-32B \ --model Qwen/QwQ-32B \
--dataset-name hf \ --dataset-name hf \
--dataset-path AI-MO/aimo-validation-aime \ --dataset-path AI-MO/aimo-validation-aime \
@ -248,7 +248,7 @@ vllm bench serve \
**`philschmid/mt-bench`** **`philschmid/mt-bench`**
``` bash ``` bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--model Qwen/QwQ-32B \ --model Qwen/QwQ-32B \
--dataset-name hf \ --dataset-name hf \
--dataset-path philschmid/mt-bench \ --dataset-path philschmid/mt-bench \
@ -261,7 +261,7 @@ When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command: parameters can be specified. Example client command:
```bash ```bash
vllm bench serve \ python3 vllm/benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \ --model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \ --endpoint /v1/completions \
@ -296,7 +296,7 @@ The following arguments can be used to control the ramp-up:
<br/> <br/>
```bash ```bash
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--model NousResearch/Hermes-3-Llama-3.1-8B \ --model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset-name sonnet \ --dataset-name sonnet \
--dataset-path vllm/benchmarks/sonnet.txt \ --dataset-path vllm/benchmarks/sonnet.txt \
@ -314,7 +314,7 @@ Total num output tokens: 1500
**VisionArena Benchmark for Vision Language Models** **VisionArena Benchmark for Vision Language Models**
``` bash ``` bash
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \ --backend vllm-chat \
--dataset-name hf \ --dataset-name hf \
@ -336,7 +336,7 @@ Total num output tokens: 1280
``` bash ``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \ VLLM_USE_V1=1 \
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--dataset-name=hf \ --dataset-name=hf \
--dataset-path=likaixin/InstructCoder \ --dataset-path=likaixin/InstructCoder \
--model=meta-llama/Meta-Llama-3-8B-Instruct \ --model=meta-llama/Meta-Llama-3-8B-Instruct \
@ -360,7 +360,7 @@ Total num output tokens: 204800
**`lmms-lab/LLaVA-OneVision-Data`** **`lmms-lab/LLaVA-OneVision-Data`**
```bash ```bash
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \ --backend vllm-chat \
--dataset-name hf \ --dataset-name hf \
@ -373,7 +373,7 @@ vllm bench throughput \
**`Aeala/ShareGPT_Vicuna_unfiltered`** **`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash ```bash
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \ --model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \ --backend vllm-chat \
--dataset-name hf \ --dataset-name hf \
@ -385,7 +385,7 @@ vllm bench throughput \
**`AI-MO/aimo-validation-aime`** **`AI-MO/aimo-validation-aime`**
```bash ```bash
vllm bench throughput \ python3 benchmarks/benchmark_throughput.py \
--model Qwen/QwQ-32B \ --model Qwen/QwQ-32B \
--backend vllm \ --backend vllm \
--dataset-name hf \ --dataset-name hf \
@ -399,7 +399,7 @@ vllm bench throughput \
``` bash ``` bash
# download dataset # download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench throughput \ python3 vllm/benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-2-7b-hf \ --model meta-llama/Llama-2-7b-hf \
--backend vllm \ --backend vllm \
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \ --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \

View File

@ -1,18 +1,45 @@
#!/bin/bash #!/bin/bash
# This script aims to tune the best server parameter combinations to maximize throughput for given requirement. # This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
# See details in README (benchmarks/auto_tune/README.md). # The current server parameter combination is max_num_seqs and max_num_batched_tokens
# It also supports additional requirement: e2e latency and prefix cache.
# Pre-requisite:
# 1. Checkout to your branch, install/ update the correct running env. For TPU, activate conda env and install the corresponding torch, xla version.
# 2. If the model is customized, replace the MODEL's config with the customized config.
# 3. Set variables (ALL REQUIRED)
# BASE: your directory for vllm repo
# MODEL: the model served by vllm
# SYSTEM: the hardware, choice TPU or GPU, for other systems, "get best profile" might not support.
# TP: ways of tensor parallelism
# DOWNLOAD_DIR: directory to download and load model weights.
# INPUT_LEN: request input len
# OUTPUT_LEN: request output len
# MIN_CACHE_HIT_PCT: prefix cache rate
# MAX_LATENCY_ALLOWED_MS: (e2e) latency requirement. If there's no latency requirement, set it to a large number like 1000000000
# NUM_SEQS_LIST: a list of `max-num-seqs` you want to loop with.
# NUM_BATCHED_TOKENS_LIST: a list of `max-num-batched-tokens` you want to loop with.
# Note that the default NUM_SEQS_LIST and NUM_BATCHED_TOKENS_LIST are set for medium size input/output len, for extra short context (such as 20:20), you might need to include larger numbers in NUM_SEQS_LIST.
# 4. Run the script, it might take a long time, you can use tmux to avoid the script stop if disconnection happens.
# 5. The final result will be saved in RESULT file.
# Example use cases
# 1. Given input_len=1800, output_len=20, what's the best max_num_seqs and max_num_batched_tokens to get highest throughput?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=100000000000
# 2. If we have latency requirement to be lower than 500ms, what's the best server parameter?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=500
# 3. If we want to reach 60% prefix cache, what's the best server parameter?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=60, MAX_LATENCY_ALLOWED_MS=500
TAG=$(date +"%Y_%m_%d_%H_%M") TAG=$(date +"%Y_%m_%d_%H_%M")
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) BASE=""
BASE="$SCRIPT_DIR/../../.."
MODEL="meta-llama/Llama-3.1-8B-Instruct" MODEL="meta-llama/Llama-3.1-8B-Instruct"
SYSTEM="TPU" SYSTEM="TPU"
TP=1 TP=1
DOWNLOAD_DIR="" DOWNLOAD_DIR=""
INPUT_LEN=4000 INPUT_LEN=4000
OUTPUT_LEN=16 OUTPUT_LEN=16
MAX_MODEL_LEN=4096
MIN_CACHE_HIT_PCT=0 MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=100000000000 MAX_LATENCY_ALLOWED_MS=100000000000
NUM_SEQS_LIST="128 256" NUM_SEQS_LIST="128 256"
@ -38,13 +65,6 @@ current_hash=$(git rev-parse HEAD)
echo "hash:$current_hash" >> "$RESULT" echo "hash:$current_hash" >> "$RESULT"
echo "current_hash: $current_hash" echo "current_hash: $current_hash"
TOTAL_LEN=$((INPUT_LEN + OUTPUT_LEN))
RED='\033[0;31m'
if (( TOTAL_LEN > MAX_MODEL_LEN )); then
echo -e "${RED}FAILED: INPUT_LEN($INPUT_LEN) + OUTPUT_LEN($OUTPUT_LEN) = $TOTAL_LEN, which is > MAX_MODEL_LEN = $MAX_MODEL_LEN.\033[0m" >&2
exit 1
fi
best_throughput=0 best_throughput=0
best_max_num_seqs=0 best_max_num_seqs=0
best_num_batched_tokens=0 best_num_batched_tokens=0
@ -56,7 +76,7 @@ start_server() {
local max_num_batched_tokens=$3 local max_num_batched_tokens=$3
local vllm_log=$4 local vllm_log=$4
local profile_dir=$5 local profile_dir=$5
pkill -f vllm pkill -f vllm
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \ VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
@ -69,13 +89,13 @@ start_server() {
--enable-prefix-caching \ --enable-prefix-caching \
--load-format dummy \ --load-format dummy \
--download-dir "$DOWNLOAD_DIR" \ --download-dir "$DOWNLOAD_DIR" \
--max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 & --max-model-len $(( INPUT_LEN+OUTPUT_LEN )) > "$vllm_log" 2>&1 &
# wait for 10 minutes... # wait for 10 minutes...
server_started=0 server_started=0
for i in {1..60}; do for i in {1..60}; do
RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout) RESPONSE=$(curl -s -X GET "http://0.0.0.0:8004/health" -w "%{http_code}" -o /dev/stdout)
STATUS_CODE=$(echo "$RESPONSE" | tail -n 1) STATUS_CODE=$(echo "$RESPONSE" | tail -n 1)
if [[ "$STATUS_CODE" -eq 200 ]]; then if [[ "$STATUS_CODE" -eq 200 ]]; then
server_started=1 server_started=1
break break
@ -98,10 +118,10 @@ update_best_profile() {
selected_profile_file= selected_profile_file=
if [[ "$SYSTEM" == "TPU" ]]; then if [[ "$SYSTEM" == "TPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb" selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
fi fi
if [[ "$SYSTEM" == "GPU" ]]; then if [[ "$SYSTEM" == "GPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}" selected_profile_file="${sorted_paths[$profile_index]}"
fi fi
rm -f $PROFILE_PATH/* rm -f $PROFILE_PATH/*
cp $selected_profile_file $PROFILE_PATH cp $selected_profile_file $PROFILE_PATH
} }
@ -129,18 +149,17 @@ run_benchmark() {
echo "server started." echo "server started."
fi fi
echo echo
echo "run benchmark test..." echo "run benchmark test..."
meet_latency_requirement=0 meet_latency_requirement=0
# get a basic qps by using request-rate inf # get a basic qps by using request-rate inf
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt" bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt"
prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 )) prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 ))
adjusted_input_len=$(( INPUT_LEN - prefix_len )) python benchmarks/benchmark_serving.py \
vllm bench serve \
--backend vllm \ --backend vllm \
--model $MODEL \ --model $MODEL \
--dataset-name random \ --dataset-name random \
--random-input-len $adjusted_input_len \ --random-input-len $INPUT_LEN \
--random-output-len $OUTPUT_LEN \ --random-output-len $OUTPUT_LEN \
--ignore-eos \ --ignore-eos \
--disable-tqdm \ --disable-tqdm \
@ -169,11 +188,11 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len ))
curl -X POST http://0.0.0.0:8004/reset_prefix_cache curl -X POST http://0.0.0.0:8004/reset_prefix_cache
sleep 5 sleep 5
bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt" bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_${request_rate}.txt"
vllm bench serve \ python benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model $MODEL \ --model $MODEL \
--dataset-name random \ --dataset-name random \
--random-input-len $adjusted_input_len \ --random-input-len $INPUT_LEN \
--random-output-len $OUTPUT_LEN \ --random-output-len $OUTPUT_LEN \
--ignore-eos \ --ignore-eos \
--disable-tqdm \ --disable-tqdm \
@ -254,3 +273,4 @@ done
echo "finish permutations" echo "finish permutations"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"

View File

@ -1,141 +0,0 @@
# Automated vLLM Server Parameter Tuning
This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Configuration](#configuration)
- [How to Run](#how-to-run)
- [Example Use Cases](#example-use-cases)
- [Output](#output)
- [How It Works](#how-it-works)
## Prerequisites
Before running the script, please ensure the following steps are completed:
1. **Clone vLLM & Set Up Branch**: Clone the vLLM repository and check out to your desired branch.
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
# git checkout <your-branch>
```
1. **Install Environment**: Install or update the correct running environment. For TPU usage, activate your `conda` environment and install the corresponding `torch` and `torch_xla` versions.
2. **Model Configuration**: If you are using a customized model, ensure its configuration files are correctly placed and accessible.
## Configuration
You must set the following variables at the top of the script before execution.
| Variable | Description | Example Value |
| --- | --- | --- |
| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` |
| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` |
| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` |
| `TP` | **Required.** The tensor-parallelism size. | `1` |
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
| `MAX_MODEL_LEN` | **Required.** Max model length. | `4096` |
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
| `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` |
**Note**: The default `NUM_SEQS_LIST` and `NUM_BATCHED_TOKENS_LIST` are set for medium-sized inputs/outputs. For very short contexts (e.g., 20 input, 20 output tokens), you may need to test larger values for `max-num-seqs`.
## How to Run
1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.
```
cd <FOLDER_OF_THIS_SCRIPT>
bash auto_tune.sh
```
Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself.
## Example Use Cases
Here are a few examples of how to configure the script for different goals:
### 1. Maximize Throughput (No Latency Constraint)
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
- **Configuration**:
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
```
#### 2. Maximize Throughput with a Latency Requirement
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
- **Configuration**:
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=500
```
#### 3. Maximize Throughput with Prefix Caching and Latency Requirements
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
- **Configuration**:
```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MAX_MODEL_LEN=2048
MIN_CACHE_HIT_PCT=60
MAX_LATENCY_ALLOWED_MS=500
```
## Output
After the script finishes, you will find the results in a new, timestamped directory created inside `$BASE/auto-benchmark/`.
- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
- `bm_log_...txt`: The log output from the `vllm bench serve` command for each benchmark run.
- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.
```
# Example result.txt content
hash:a1b2c3d4...
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
max_num_seqs: 128, max_num_batched_tokens: 4096 does not meet latency requirement 500
...
best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile
```
If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict.
- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run.
## How It Works
The script follows a systematic process to find the optimal parameters:
1. **Find Max GPU Memory Utilization**: The script first determines the highest safe `gpu-memory-utilization` (starting from 0.98 and decreasing) that does not cause an Out-Of-Memory (OOM) error when launching the server. This ensures the benchmark runs use the maximum available memory without crashing.
2. **Iterate and Benchmark**: It then enters a nested loop, iterating through every combination of `max-num-seqs` and `max-num-batched-tokens` provided in the configuration lists.
3. **Latency-Aware Throughput Search**: For each parameter combination:
- The vLLM server is started.
- A benchmark is first run with an infinite request rate (`--request-rate inf`).
- If the resulting P99 E2E latency is within the `MAX_LATENCY_ALLOWED_MS` limit, this throughput is considered the maximum for this configuration.
- If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.
4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.
5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.

View File

@ -30,11 +30,17 @@ from datasets import load_dataset
from PIL import Image from PIL import Image
from transformers import PreTrainedTokenizerBase from transformers import PreTrainedTokenizerBase
from vllm.lora.request import LoRARequest try:
from vllm.lora.utils import get_adapter_absolute_path from vllm.lora.request import LoRARequest
from vllm.multimodal import MultiModalDataDict from vllm.lora.utils import get_adapter_absolute_path
from vllm.multimodal.image import convert_image_mode from vllm.multimodal import MultiModalDataDict
from vllm.transformers_utils.tokenizer import AnyTokenizer, get_lora_tokenizer from vllm.multimodal.image import convert_image_mode
from vllm.transformers_utils.tokenizer import AnyTokenizer, get_lora_tokenizer
except:
MultiModalDataDict = None
AnyTokenizer = None
LoRARequest = None
print("Install vLLM to use LoRA or Multimodal benchmarking.")
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)

View File

@ -11,7 +11,6 @@ from typing import Any, Optional
import numpy as np import numpy as np
from tqdm import tqdm from tqdm import tqdm
from typing_extensions import deprecated
import vllm.envs as envs import vllm.envs as envs
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
@ -35,10 +34,6 @@ def save_to_pytorch_benchmark_format(
write_to_json(pt_file, pt_records) write_to_json(pt_file, pt_records)
@deprecated(
"benchmark_latency.py is deprecated and will be removed in a "
"future version. Please use 'vllm bench latency' instead.",
)
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
print(args) print(args)

View File

@ -30,7 +30,7 @@ import os
import random import random
import time import time
import warnings import warnings
from collections.abc import Iterable from collections.abc import AsyncGenerator, Iterable
from dataclasses import dataclass from dataclasses import dataclass
from datetime import datetime from datetime import datetime
from typing import Any, Literal, Optional from typing import Any, Literal, Optional
@ -38,7 +38,6 @@ from typing import Any, Literal, Optional
import numpy as np import numpy as np
from tqdm.asyncio import tqdm from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase from transformers import PreTrainedTokenizerBase
from typing_extensions import deprecated
from backend_request_func import ( from backend_request_func import (
ASYNC_REQUEST_FUNCS, ASYNC_REQUEST_FUNCS,
@ -74,7 +73,6 @@ from benchmark_dataset import (
VisionArenaDataset, VisionArenaDataset,
) )
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
from vllm.benchmarks.serve import get_request
MILLISECONDS_TO_SECONDS_CONVERSION = 1000 MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@ -109,6 +107,101 @@ class BenchmarkMetrics:
percentiles_e2el_ms: list[tuple[float, float]] percentiles_e2el_ms: list[tuple[float, float]]
def _get_current_request_rate(
ramp_up_strategy: Optional[Literal["linear", "exponential"]],
ramp_up_start_rps: Optional[int],
ramp_up_end_rps: Optional[int],
request_index: int,
total_requests: int,
request_rate: float,
) -> float:
if (
ramp_up_strategy
and ramp_up_start_rps is not None
and ramp_up_end_rps is not None
):
progress = request_index / max(total_requests - 1, 1)
if ramp_up_strategy == "linear":
increase = (ramp_up_end_rps - ramp_up_start_rps) * progress
return ramp_up_start_rps + increase
elif ramp_up_strategy == "exponential":
ratio = ramp_up_end_rps / ramp_up_start_rps
return ramp_up_start_rps * (ratio**progress)
else:
raise ValueError(f"Unknown ramp-up strategy: {ramp_up_strategy}")
return request_rate
async def get_request(
input_requests: list[SampleRequest],
request_rate: float,
burstiness: float = 1.0,
ramp_up_strategy: Optional[Literal["linear", "exponential"]] = None,
ramp_up_start_rps: Optional[int] = None,
ramp_up_end_rps: Optional[int] = None,
) -> AsyncGenerator[tuple[SampleRequest, float], None]:
"""
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness and OPTIONAL ramp-up strategy.
Args:
input_requests:
A list of input requests, each represented as a SampleRequest.
request_rate:
The rate at which requests are generated (requests/s).
burstiness (optional):
The burstiness factor of the request generation.
Only takes effect when request_rate is not inf.
Default value is 1, which follows a Poisson process.
Otherwise, the request intervals follow a gamma distribution.
A lower burstiness value (0 < burstiness < 1) results
in more bursty requests, while a higher burstiness value
(burstiness > 1) results in a more uniform arrival of requests.
ramp_up_strategy (optional):
The ramp-up strategy. Can be "linear" or "exponential".
If None, uses constant request rate (specified by request_rate).
ramp_up_start_rps (optional):
The starting request rate for ramp-up.
ramp_up_end_rps (optional):
The ending request rate for ramp-up.
"""
assert burstiness > 0, (
f"A positive burstiness factor is expected, but given {burstiness}."
)
# Convert to list to get length for ramp-up calculations
if isinstance(input_requests, Iterable) and not isinstance(input_requests, list):
input_requests = list(input_requests)
total_requests = len(input_requests)
request_index = 0
for request in input_requests:
current_request_rate = _get_current_request_rate(
ramp_up_strategy,
ramp_up_start_rps,
ramp_up_end_rps,
request_index,
total_requests,
request_rate,
)
yield request, current_request_rate
request_index += 1
if current_request_rate == float("inf"):
# If the request rate is infinity, then we don't need to wait.
continue
theta = 1.0 / (current_request_rate * burstiness)
# Sample the request interval from the gamma distribution.
# If burstiness is 1, it follows exponential distribution.
interval = np.random.gamma(shape=burstiness, scale=theta)
# The next request will be sent after the interval.
await asyncio.sleep(interval)
def calculate_metrics( def calculate_metrics(
input_requests: list[SampleRequest], input_requests: list[SampleRequest],
outputs: list[RequestFuncOutput], outputs: list[RequestFuncOutput],
@ -594,10 +687,6 @@ def save_to_pytorch_benchmark_format(
write_to_json(pt_file, pt_records) write_to_json(pt_file, pt_records)
@deprecated(
"benchmark_serving.py is deprecated and will be removed in a future "
"version. Please use 'vllm bench serve' instead.",
)
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
print(args) print(args)
random.seed(args.seed) random.seed(args.seed)

View File

@ -15,7 +15,6 @@ import torch
import uvloop import uvloop
from tqdm import tqdm from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerBase from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerBase
from typing_extensions import deprecated
from benchmark_dataset import ( from benchmark_dataset import (
AIMODataset, AIMODataset,
@ -168,8 +167,7 @@ async def run_vllm_async(
from vllm import SamplingParams from vllm import SamplingParams
async with build_async_engine_client_from_engine_args( async with build_async_engine_client_from_engine_args(
engine_args, engine_args, disable_frontend_multiprocessing
disable_frontend_multiprocessing=disable_frontend_multiprocessing,
) as llm: ) as llm:
model_config = await llm.get_model_config() model_config = await llm.get_model_config()
assert all( assert all(
@ -383,10 +381,6 @@ def get_requests(args, tokenizer):
return dataset_cls(**common_kwargs).sample(**sample_kwargs) return dataset_cls(**common_kwargs).sample(**sample_kwargs)
@deprecated(
"benchmark_throughput.py is deprecated and will be removed in a "
"future version. Please use 'vllm bench throughput' instead.",
)
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
if args.seed is None: if args.seed is None:
args.seed = 0 args.seed = 0

View File

@ -3,7 +3,7 @@
# benchmark the overhead of disaggregated prefill. # benchmark the overhead of disaggregated prefill.
# methodology: # methodology:
# - send all request to prefill vLLM instance. It will buffer KV cache. # - send all request to prefill vLLM instance. It will buffer KV cache.
# - then send all request to decode instance. # - then send all request to decode instance.
# - The TTFT of decode instance is the overhead. # - The TTFT of decode instance is the overhead.
set -ex set -ex
@ -12,8 +12,6 @@ kill_gpu_processes() {
# kill all processes on GPU. # kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9 pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9 pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
sleep 10 sleep 10
# remove vllm config file # remove vllm config file
@ -63,7 +61,7 @@ benchmark() {
--gpu-memory-utilization 0.6 \ --gpu-memory-utilization 0.6 \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' & '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \ CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \ -m vllm.entrypoints.openai.api_server \
@ -78,38 +76,38 @@ benchmark() {
wait_for_server 8200 wait_for_server 8200
# let the prefill instance finish prefill # let the prefill instance finish prefill
vllm bench serve \ python3 ../benchmark_serving.py \
--backend vllm \ --backend vllm \
--model $model \ --model $model \
--dataset-name $dataset_name \ --dataset-name $dataset_name \
--dataset-path $dataset_path \ --dataset-path $dataset_path \
--sonnet-input-len $input_len \ --sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \ --sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \ --sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \ --num-prompts $num_prompts \
--port 8100 \ --port 8100 \
--save-result \ --save-result \
--result-dir $results_folder \ --result-dir $results_folder \
--result-filename disagg_prefill_tp1.json \ --result-filename disagg_prefill_tp1.json \
--request-rate "inf" --request-rate "inf"
# send the request to decode. # send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl. # The TTFT of this command will be the overhead of disagg prefill impl.
vllm bench serve \ python3 ../benchmark_serving.py \
--backend vllm \ --backend vllm \
--model $model \ --model $model \
--dataset-name $dataset_name \ --dataset-name $dataset_name \
--dataset-path $dataset_path \ --dataset-path $dataset_path \
--sonnet-input-len $input_len \ --sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \ --sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \ --sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \ --num-prompts $num_prompts \
--port 8200 \ --port 8200 \
--save-result \ --save-result \
--result-dir $results_folder \ --result-dir $results_folder \
--result-filename disagg_prefill_tp1_overhead.json \ --result-filename disagg_prefill_tp1_overhead.json \
--request-rate "$qps" --request-rate "$qps"
kill_gpu_processes kill_gpu_processes
} }

View File

@ -18,8 +18,6 @@ kill_gpu_processes() {
# kill all processes on GPU. # kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9 pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9 pgrep python3 | xargs -r kill -9
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1 sleep 1
} }
@ -60,7 +58,7 @@ launch_chunked_prefill() {
launch_disagg_prefill() { launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct" model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill # disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \ CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \ -m vllm.entrypoints.openai.api_server \
@ -99,20 +97,20 @@ benchmark() {
output_len=$2 output_len=$2
tag=$3 tag=$3
vllm bench serve \ python3 ../benchmark_serving.py \
--backend vllm \ --backend vllm \
--model $model \ --model $model \
--dataset-name $dataset_name \ --dataset-name $dataset_name \
--dataset-path $dataset_path \ --dataset-path $dataset_path \
--sonnet-input-len $input_len \ --sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \ --sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \ --sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \ --num-prompts $num_prompts \
--port 8000 \ --port 8000 \
--save-result \ --save-result \
--result-dir $results_folder \ --result-dir $results_folder \
--result-filename "$tag"-qps-"$qps".json \ --result-filename "$tag"-qps-"$qps".json \
--request-rate "$qps" --request-rate "$qps"
sleep 2 sleep 2
} }

View File

@ -576,11 +576,7 @@ def main(args: argparse.Namespace):
topk = config.num_experts_per_tok topk = config.num_experts_per_tok
intermediate_size = config.intermediate_size intermediate_size = config.intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size shard_intermediate_size = 2 * intermediate_size // args.tp_size
elif config.architectures[0] in ( elif config.architectures[0] in ("DeepseekV3ForCausalLM", "DeepseekV2ForCausalLM"):
"DeepseekV3ForCausalLM",
"DeepseekV2ForCausalLM",
"Glm4MoeForCausalLM",
):
E = config.n_routed_experts E = config.n_routed_experts
topk = config.num_experts_per_tok topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size intermediate_size = config.moe_intermediate_size
@ -590,11 +586,6 @@ def main(args: argparse.Namespace):
topk = config.num_experts_per_tok topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size shard_intermediate_size = 2 * intermediate_size // args.tp_size
elif config.architectures[0] in ("HunYuanMoEV1ForCausalLM"):
E = config.num_experts
topk = config.moe_topk[0]
intermediate_size = config.moe_intermediate_size[0]
shard_intermediate_size = 2 * intermediate_size // args.tp_size
else: else:
# Support for llama4 # Support for llama4
config = config.get_text_config() config = config.get_text_config()

View File

@ -5,8 +5,9 @@ import itertools
import torch import torch
from vllm import _custom_ops as ops
from vllm.model_executor.layers.fused_moe.moe_align_block_size import ( from vllm.model_executor.layers.fused_moe.moe_align_block_size import (
moe_align_block_size, moe_align_block_size_triton,
) )
from vllm.triton_utils import triton from vllm.triton_utils import triton
@ -20,6 +21,62 @@ def get_topk_ids(num_tokens: int, num_experts: int, topk: int) -> torch.Tensor:
) )
def check_correctness(num_tokens, num_experts=256, block_size=256, topk=8):
"""
Verifies vllm vs. Triton
"""
topk_ids = get_topk_ids(num_tokens, num_experts, topk)
# 1. malloc space for triton and vllm
# malloc enough space (max_num_tokens_padded) for the sorted ids
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids_triton = torch.empty(
(max_num_tokens_padded,), dtype=torch.int32, device="cuda"
)
sorted_ids_triton.fill_(topk_ids.numel()) # fill with sentinel value
expert_ids_triton = torch.zeros(
(max_num_tokens_padded // block_size,), dtype=torch.int32, device="cuda"
)
num_tokens_post_pad_triton = torch.empty((1,), dtype=torch.int32, device="cuda")
sorted_ids_vllm = torch.empty_like(sorted_ids_triton)
sorted_ids_vllm.fill_(topk_ids.numel())
expert_ids_vllm = torch.zeros_like(expert_ids_triton)
num_tokens_post_pad_vllm = torch.empty_like(num_tokens_post_pad_triton)
# 2. run implementations
moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids_triton,
expert_ids_triton,
num_tokens_post_pad_triton,
)
ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids_vllm,
expert_ids_vllm,
num_tokens_post_pad_vllm,
)
print(f"✅ VLLM implementation works with {num_experts} experts!")
# 3. compare results
if torch.allclose(expert_ids_triton, expert_ids_vllm) and torch.allclose(
num_tokens_post_pad_triton, num_tokens_post_pad_vllm
):
print("✅ Triton and VLLM implementations match.")
else:
print("❌ Triton and VLLM implementations DO NOT match.")
print("Triton expert_ids:", expert_ids_triton)
print("VLLM expert_ids:", expert_ids_vllm)
print("Triton num_tokens_post_pad:", num_tokens_post_pad_triton)
print("VLLM num_tokens_post_pad:", num_tokens_post_pad_vllm)
# test configurations # test configurations
num_tokens_range = [1, 16, 256, 4096] num_tokens_range = [1, 16, 256, 4096]
num_experts_range = [16, 64, 224, 256, 280, 512] num_experts_range = [16, 64, 224, 256, 280, 512]
@ -32,8 +89,8 @@ configs = list(itertools.product(num_tokens_range, num_experts_range, topk_range
x_names=["num_tokens", "num_experts", "topk"], x_names=["num_tokens", "num_experts", "topk"],
x_vals=configs, x_vals=configs,
line_arg="provider", line_arg="provider",
line_vals=["vllm"], line_vals=["vllm", "triton"], # "triton"
line_names=["vLLM"], line_names=["VLLM", "Triton"], # "Triton"
plot_name="moe-align-block-size-performance", plot_name="moe-align-block-size-performance",
args={}, args={},
) )
@ -43,11 +100,37 @@ def benchmark(num_tokens, num_experts, topk, provider):
block_size = 256 block_size = 256
topk_ids = get_topk_ids(num_tokens, num_experts, topk) topk_ids = get_topk_ids(num_tokens, num_experts, topk)
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids = torch.empty((max_num_tokens_padded,), dtype=torch.int32, device="cuda")
sorted_ids.fill_(topk_ids.numel())
max_num_m_blocks = max_num_tokens_padded // block_size
expert_ids = torch.empty((max_num_m_blocks,), dtype=torch.int32, device="cuda")
num_tokens_post_pad = torch.empty((1,), dtype=torch.int32, device="cuda")
quantiles = [0.5, 0.2, 0.8] quantiles = [0.5, 0.2, 0.8]
if provider == "vllm": if provider == "vllm":
ms, min_ms, max_ms = triton.testing.do_bench( ms, min_ms, max_ms = triton.testing.do_bench(
lambda: moe_align_block_size(topk_ids, block_size, num_experts), lambda: ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
quantiles=quantiles,
)
elif provider == "triton":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
quantiles=quantiles, quantiles=quantiles,
) )
@ -71,4 +154,6 @@ if __name__ == "__main__":
) )
args = parser.parse_args() args = parser.parse_args()
print("Running correctness check...")
check_correctness(num_tokens=1024, num_experts=args.num_experts, topk=args.topk)
benchmark.run(print_data=True, show_plots=True) benchmark.run(print_data=True, show_plots=True)

View File

@ -8,13 +8,12 @@ import ray
import torch import torch
from transformers import AutoConfig from transformers import AutoConfig
from vllm.model_executor.layers.fused_moe.fused_moe import * from vllm.model_executor.layers.fused_moe.deep_gemm_moe import (
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
_moe_permute, _moe_permute,
_moe_unpermute_and_reduce, _moe_unpermute_and_reduce,
moe_permute,
moe_unpermute,
) )
from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import *
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
from vllm.platforms import current_platform from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser from vllm.utils import FlexibleArgumentParser
@ -64,19 +63,18 @@ def benchmark_permute(
def run(): def run():
if use_customized_permute: if use_customized_permute:
( (permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
permuted_hidden_states, moe_permute(
a1q_scale, qhidden_states,
first_token_off, topk_weights=topk_weights,
inv_perm_idx, topk_ids=topk_ids,
m_indices, token_expert_indices=token_expert_indices,
) = moe_permute( topk=topk,
qhidden_states, n_expert=num_experts,
a1q_scale=None, n_local_expert=num_experts,
topk_ids=topk_ids, expert_map=None,
n_expert=num_experts, align_block_size=align_block_size,
expert_map=None, )
align_block_size=align_block_size,
) )
else: else:
( (
@ -152,19 +150,18 @@ def benchmark_unpermute(
def prepare(): def prepare():
if use_customized_permute: if use_customized_permute:
( (permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = (
permuted_hidden_states, moe_permute(
a1q_scale, qhidden_states,
first_token_off, topk_weights=topk_weights,
inv_perm_idx, topk_ids=topk_ids,
m_indices, token_expert_indices=token_expert_indices,
) = moe_permute( topk=topk,
qhidden_states, n_expert=num_experts,
a1q_scale=None, n_local_expert=num_experts,
topk_ids=topk_ids, expert_map=None,
n_expert=num_experts, align_block_size=align_block_size,
expert_map=None, )
align_block_size=align_block_size,
) )
# convert to fp16/bf16 as gemm output # convert to fp16/bf16 as gemm output
return ( return (
@ -194,19 +191,16 @@ def benchmark_unpermute(
def run(input: tuple): def run(input: tuple):
if use_customized_permute: if use_customized_permute:
( (permuted_hidden_states, first_token_off, inv_perm_idx, m_indices) = input
permuted_hidden_states,
first_token_off,
inv_perm_idx,
m_indices,
) = input
output = torch.empty_like(hidden_states)
moe_unpermute( moe_unpermute(
output,
permuted_hidden_states, permuted_hidden_states,
topk_weights, topk_weights,
topk_ids,
inv_perm_idx, inv_perm_idx,
first_token_off, first_token_off,
topk,
num_experts,
num_experts,
) )
else: else:
( (
@ -217,11 +211,7 @@ def benchmark_unpermute(
inv_perm, inv_perm,
) = input ) = input
_moe_unpermute_and_reduce( _moe_unpermute_and_reduce(
output_hidden_states, output_hidden_states, permuted_hidden_states, inv_perm, topk_weights
permuted_hidden_states,
inv_perm,
topk_weights,
True,
) )
# JIT compilation & warmup # JIT compilation & warmup
@ -328,7 +318,6 @@ def main(args: argparse.Namespace):
elif ( elif (
config.architectures[0] == "DeepseekV3ForCausalLM" config.architectures[0] == "DeepseekV3ForCausalLM"
or config.architectures[0] == "DeepseekV2ForCausalLM" or config.architectures[0] == "DeepseekV2ForCausalLM"
or config.architectures[0] == "Glm4MoeForCausalLM"
): ):
E = config.n_routed_experts E = config.n_routed_experts
topk = config.num_experts_per_tok topk = config.num_experts_per_tok

View File

@ -1,108 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import gc
import time
from typing import Optional
from tabulate import tabulate
from vllm.utils import FlexibleArgumentParser
from vllm.v1.core.block_pool import BlockPool
class Metric:
def __init__(self) -> None:
self.cnt: int = 0
self.sum_v: int = 0
self.max_v: Optional[int] = None
def update(self, v: int) -> None:
self.cnt += 1
self.sum_v += v
if self.max_v is None:
self.max_v = v
else:
self.max_v = max(self.max_v, v)
def avg_v(self) -> float:
return self.sum_v * 1.0 / self.cnt
def main(args):
rows = []
for allocate_block in args.allocate_blocks:
# Enforce a GC collect ahead to minimize the impact among runs
gc.collect()
block_pool = BlockPool(num_gpu_blocks=args.num_gpu_blocks, enable_caching=True)
get_blocks_metric: Metric = Metric()
free_blocks_metric: Metric = Metric()
for _ in range(args.num_iteration):
t1 = time.monotonic_ns()
blocks = block_pool.get_new_blocks(allocate_block)
t2 = time.monotonic_ns()
block_pool.free_blocks(blocks)
t3 = time.monotonic_ns()
get_blocks_metric.update(t2 - t1)
free_blocks_metric.update(t3 - t2)
if get_blocks_metric.max_v is not None and free_blocks_metric.max_v is not None:
rows.append(
[
get_blocks_metric.cnt,
args.num_gpu_blocks,
allocate_block,
get_blocks_metric.avg_v() / 1000000,
get_blocks_metric.max_v / 1000000.0,
free_blocks_metric.avg_v() / 1000000,
free_blocks_metric.max_v / 1000000.0,
]
)
else:
print(
"No valid metrics found."
f" {get_blocks_metric.max_v=} {free_blocks_metric.max_v=}"
)
print(
tabulate(
rows,
headers=[
"Iterations",
"Total\nBlocks",
"Allocated\nBlocks",
"Get Blocks\nAvg (ms)",
"Get Blocks\nMax (ms)",
"Free Blocks\nAvg (ms)",
"Free Blocks\nMax (ms)",
],
tablefmt="grid",
floatfmt=".6f",
)
)
def invoke_main() -> None:
parser = FlexibleArgumentParser(
description="Benchmark the performance of BlockPool for KV Cache."
)
parser.add_argument("--num-gpu-blocks", type=int, default=100000)
parser.add_argument(
"--num-iteration",
type=int,
default=1000,
help="Number of iterations to run to stablize final data readings",
)
parser.add_argument(
"--allocate-blocks",
type=int,
nargs="*",
default=[10, 50, 100, 500, 1000],
help="Number of blocks to allocate",
)
args = parser.parse_args()
main(args)
if __name__ == "__main__":
invoke_main() # pragma: no cover

View File

@ -58,22 +58,6 @@ function (find_isa CPUINFO TARGET OUT)
endif() endif()
endfunction() endfunction()
function(check_sysctl TARGET OUT)
execute_process(COMMAND sysctl -n "${TARGET}"
RESULT_VARIABLE SYSCTL_RET
OUTPUT_VARIABLE SYSCTL_INFO
ERROR_QUIET
OUTPUT_STRIP_TRAILING_WHITESPACE)
if(SYSCTL_RET EQUAL 0 AND
(SYSCTL_INFO STREQUAL "1" OR SYSCTL_INFO GREATER 0))
set(${OUT} ON PARENT_SCOPE)
else()
set(${OUT} OFF PARENT_SCOPE)
endif()
endfunction()
function (is_avx512_disabled OUT) function (is_avx512_disabled OUT)
set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512}) set(DISABLE_AVX512 $ENV{VLLM_CPU_DISABLE_AVX512})
if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true") if(DISABLE_AVX512 AND DISABLE_AVX512 STREQUAL "true")
@ -86,10 +70,7 @@ endfunction()
is_avx512_disabled(AVX512_DISABLED) is_avx512_disabled(AVX512_DISABLED)
if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64") if (MACOSX_FOUND AND CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
message(STATUS "Apple Silicon Detected") set(APPLE_SILICON_FOUND TRUE)
set(ENABLE_NUMA OFF)
check_sysctl(hw.optional.neon ASIMD_FOUND)
check_sysctl(hw.optional.arm.FEAT_BF16 ARM_BF16_FOUND)
else() else()
find_isa(${CPUINFO} "avx2" AVX2_FOUND) find_isa(${CPUINFO} "avx2" AVX2_FOUND)
find_isa(${CPUINFO} "avx512f" AVX512_FOUND) find_isa(${CPUINFO} "avx512f" AVX512_FOUND)
@ -101,6 +82,7 @@ else()
find_isa(${CPUINFO} "S390" S390_FOUND) find_isa(${CPUINFO} "S390" S390_FOUND)
endif() endif()
if (AVX512_FOUND AND NOT AVX512_DISABLED) if (AVX512_FOUND AND NOT AVX512_DISABLED)
list(APPEND CXX_COMPILE_FLAGS list(APPEND CXX_COMPILE_FLAGS
"-mavx512f" "-mavx512f"
@ -167,6 +149,9 @@ elseif (ASIMD_FOUND)
set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16") set(MARCH_FLAGS "-march=armv8.2-a+dotprod+fp16")
endif() endif()
list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS}) list(APPEND CXX_COMPILE_FLAGS ${MARCH_FLAGS})
elseif(APPLE_SILICON_FOUND)
message(STATUS "Apple Silicon Detected")
set(ENABLE_NUMA OFF)
elseif (S390_FOUND) elseif (S390_FOUND)
message(STATUS "S390 detected") message(STATUS "S390 detected")
# Check for S390 VXE support # Check for S390 VXE support

View File

@ -24,7 +24,6 @@
#include "attention_dtypes.h" #include "attention_dtypes.h"
#include "attention_utils.cuh" #include "attention_utils.cuh"
#include "../cuda_compat.h"
#ifdef USE_ROCM #ifdef USE_ROCM
#include <hip/hip_bf16.h> #include <hip/hip_bf16.h>
@ -34,6 +33,12 @@ typedef __hip_bfloat16 __nv_bfloat16;
#include "../quantization/fp8/nvidia/quant_utils.cuh" #include "../quantization/fp8/nvidia/quant_utils.cuh"
#endif #endif
#ifndef USE_ROCM
#define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif
#define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b))
#define DIVIDE_ROUND_UP(a, b) (((a) + (b) - 1) / (b)) #define DIVIDE_ROUND_UP(a, b) (((a) + (b) - 1) / (b))
@ -665,6 +670,7 @@ __global__ void paged_attention_v2_reduce_kernel(
} // namespace vllm } // namespace vllm
#undef WARP_SIZE
#undef MAX #undef MAX
#undef MIN #undef MIN
#undef DIVIDE_ROUND_UP #undef DIVIDE_ROUND_UP

View File

@ -18,7 +18,6 @@ limitations under the License.
* Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929 * Taken from SGLANG PR https://github.com/sgl-project/sglang/pull/6929
* by Alcanderian JieXin Liang * by Alcanderian JieXin Liang
*/ */
#include "core/registration.h"
#include <ATen/cuda/CUDAContext.h> #include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h> #include <c10/cuda/CUDAGuard.h>
@ -271,13 +270,4 @@ int64_t sm100_cutlass_mla_get_workspace_size(int64_t max_seq_len, int64_t num_ba
} }
#endif #endif
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
m.impl("sm100_cutlass_mla_decode", &sm100_cutlass_mla_decode);
}
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CatchAll, m) {
m.impl("sm100_cutlass_mla_get_workspace_size", &sm100_cutlass_mla_get_workspace_size);
}
// clang-format on // clang-format on

View File

@ -16,8 +16,14 @@
* See the License for the specific language governing permissions and * See the License for the specific language governing permissions and
* limitations under the License. * limitations under the License.
*/ */
#include "attention_kernels.cuh" #include "attention_kernels.cuh"
#include "../cuda_compat.h"
#ifndef USE_ROCM
#define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif
#define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b))
@ -74,7 +80,7 @@ void paged_attention_v1_launcher(
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr()); const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr()); const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
const int NUM_WARPS = NUM_THREADS / WARP_SIZE; constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
int padded_max_seq_len = int padded_max_seq_len =
DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE; DIVIDE_ROUND_UP(max_seq_len, BLOCK_SIZE) * BLOCK_SIZE;
int logits_size = padded_max_seq_len * sizeof(float); int logits_size = padded_max_seq_len * sizeof(float);
@ -181,6 +187,7 @@ void paged_attention_v1(
CALL_V1_LAUNCHER_BLOCK_SIZE) CALL_V1_LAUNCHER_BLOCK_SIZE)
} }
#undef WARP_SIZE
#undef MAX #undef MAX
#undef MIN #undef MIN
#undef DIVIDE_ROUND_UP #undef DIVIDE_ROUND_UP

View File

@ -16,8 +16,14 @@
* See the License for the specific language governing permissions and * See the License for the specific language governing permissions and
* limitations under the License. * limitations under the License.
*/ */
#include "attention_kernels.cuh" #include "attention_kernels.cuh"
#include "../cuda_compat.h"
#ifndef USE_ROCM
#define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif
#define MAX(a, b) ((a) > (b) ? (a) : (b)) #define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b)) #define MIN(a, b) ((a) < (b) ? (a) : (b))
@ -78,7 +84,7 @@ void paged_attention_v2_launcher(
const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr()); const float* k_scale_ptr = reinterpret_cast<const float*>(k_scale.data_ptr());
const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr()); const float* v_scale_ptr = reinterpret_cast<const float*>(v_scale.data_ptr());
const int NUM_WARPS = NUM_THREADS / WARP_SIZE; constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE;
int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE); int max_num_partitions = DIVIDE_ROUND_UP(max_seq_len, PARTITION_SIZE);
int logits_size = PARTITION_SIZE * sizeof(float); int logits_size = PARTITION_SIZE * sizeof(float);
int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float); int outputs_size = (NUM_WARPS / 2) * head_size * sizeof(float);
@ -191,6 +197,7 @@ void paged_attention_v2(
CALL_V2_LAUNCHER_BLOCK_SIZE) CALL_V2_LAUNCHER_BLOCK_SIZE)
} }
#undef WARP_SIZE
#undef MAX #undef MAX
#undef MIN #undef MIN
#undef DIVIDE_ROUND_UP #undef DIVIDE_ROUND_UP

View File

@ -58,7 +58,7 @@ namespace {
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous") #define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_LAST_DIM_CONTIGUOUS(x) \ #define CHECK_LAST_DIM_CONTIGUOUS(x) \
TORCH_CHECK(x.strides()[x.strides().size() - 1] == 1, #x "must be contiguous at last dimension") TORCH_CHECK(x.strides()[x.strides().size() - 1] == 1, #x "must be contiguous at last dimention")
#define CHECK_INPUT(x) \ #define CHECK_INPUT(x) \
CHECK_CPU(x); \ CHECK_CPU(x); \

View File

@ -126,7 +126,7 @@ void fused_experts_int4_w4a16_kernel_impl(
int64_t topk, int64_t topk,
int64_t num_tokens_post_pad); int64_t num_tokens_post_pad);
// shared expert implementation for int8 w8a8 // shared expert implememntation for int8 w8a8
template <typename scalar_t> template <typename scalar_t>
void shared_expert_int8_kernel_impl( void shared_expert_int8_kernel_impl(
scalar_t* __restrict__ output, scalar_t* __restrict__ output,

View File

@ -41,7 +41,7 @@ struct tinygemm_kernel_nn<at::BFloat16, has_bias, BLOCK_M, BLOCK_N> {
__m512 vd0; __m512 vd0;
__m512 vd1[COLS]; __m512 vd1[COLS];
// oops! 4x4 spills but luckily we use 4x2 // oops! 4x4 spills but luckly we use 4x2
__m512 vbias[COLS]; __m512 vbias[COLS];
// [NOTE]: s8s8 igemm compensation in avx512-vnni // [NOTE]: s8s8 igemm compensation in avx512-vnni

View File

@ -37,7 +37,7 @@ inline Vectorized<at::BFloat16> convert_from_float_ext<at::BFloat16>(const Vecto
#define CVT_FP16_TO_FP32(a) \ #define CVT_FP16_TO_FP32(a) \
_mm512_cvtps_ph(a, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC)) _mm512_cvtps_ph(a, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC))
// this doesn't handle NaN. // this doesn't hanel NaN.
inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec) { inline __m512bh cvt_e4m3_bf16_intrinsic_no_nan(__m256i fp8_vec) {
const __m512i x = _mm512_cvtepu8_epi16(fp8_vec); const __m512i x = _mm512_cvtepu8_epi16(fp8_vec);

View File

@ -7,7 +7,7 @@
namespace { namespace {
#define MAX_SHM_RANK_NUM 8 #define MAX_SHM_RANK_NUM 8
#define PER_THREAD_SHM_BUFFER_BYTES (4 * 1024 * 1024) #define PER_THREAD_SHM_BUFFER_BYTES (2 * 1024 * 1024)
static_assert(PER_THREAD_SHM_BUFFER_BYTES % 2 == 0); static_assert(PER_THREAD_SHM_BUFFER_BYTES % 2 == 0);
#define PER_THREAD_SHM_BUFFER_OFFSET (PER_THREAD_SHM_BUFFER_BYTES >> 1) #define PER_THREAD_SHM_BUFFER_OFFSET (PER_THREAD_SHM_BUFFER_BYTES >> 1)
#define MIN_THREAD_PROCESS_SIZE (256) #define MIN_THREAD_PROCESS_SIZE (256)
@ -34,10 +34,9 @@ struct KernelVecType<c10::Half> {
}; };
struct ThreadSHMContext { struct ThreadSHMContext {
volatile char _curr_thread_stamp[2]; volatile char _curr_thread_stamp;
volatile char _ready_thread_stamp[2]; volatile char _ready_thread_stamp;
int local_stamp_buffer_idx; char _padding1[6];
int remote_stamp_buffer_idx;
int thread_id; int thread_id;
int thread_num; int thread_num;
int rank; int rank;
@ -46,28 +45,23 @@ struct ThreadSHMContext {
int swizzled_ranks[MAX_SHM_RANK_NUM]; int swizzled_ranks[MAX_SHM_RANK_NUM];
void* thread_shm_ptrs[MAX_SHM_RANK_NUM]; void* thread_shm_ptrs[MAX_SHM_RANK_NUM];
ThreadSHMContext* shm_contexts[MAX_SHM_RANK_NUM]; ThreadSHMContext* shm_contexts[MAX_SHM_RANK_NUM];
size_t _thread_buffer_mask[2]; size_t _thread_buffer_mask;
char _padding2[40]; char _padding2[56];
ThreadSHMContext(const int thread_id, const int thread_num, const int rank, ThreadSHMContext(const int thread_id, const int thread_num, const int rank,
const int group_size, void* thread_shm_ptr) const int group_size, void* thread_shm_ptr)
: local_stamp_buffer_idx(0), : _curr_thread_stamp(1),
remote_stamp_buffer_idx(0), _ready_thread_stamp(0),
thread_id(thread_id), thread_id(thread_id),
thread_num(thread_num), thread_num(thread_num),
rank(rank), rank(rank),
group_size(group_size), group_size(group_size),
_spinning_count(0) { _spinning_count(0),
_thread_buffer_mask(0) {
static_assert(sizeof(ThreadSHMContext) % 64 == 0); static_assert(sizeof(ThreadSHMContext) % 64 == 0);
TORCH_CHECK(group_size <= MAX_SHM_RANK_NUM); TORCH_CHECK(group_size <= MAX_SHM_RANK_NUM);
TORCH_CHECK((size_t)this % 64 == 0); TORCH_CHECK((size_t)this % 64 == 0);
TORCH_CHECK((size_t)thread_shm_ptr % 64 == 0); TORCH_CHECK((size_t)thread_shm_ptr % 64 == 0);
_curr_thread_stamp[0] = 1;
_curr_thread_stamp[1] = 1;
_ready_thread_stamp[0] = 0;
_ready_thread_stamp[1] = 0;
_thread_buffer_mask[0] = 0;
_thread_buffer_mask[1] = 0;
for (int i = 0; i < MAX_SHM_RANK_NUM; ++i) { for (int i = 0; i < MAX_SHM_RANK_NUM; ++i) {
shm_contexts[i] = nullptr; shm_contexts[i] = nullptr;
thread_shm_ptrs[i] = nullptr; thread_shm_ptrs[i] = nullptr;
@ -76,11 +70,6 @@ struct ThreadSHMContext {
set_context(rank, this, thread_shm_ptr); set_context(rank, this, thread_shm_ptr);
} }
void set_stamp_buffer_idx(int local, int remote) {
local_stamp_buffer_idx = local;
remote_stamp_buffer_idx = remote;
}
void set_context(int rank, ThreadSHMContext* ptr, void* thread_shm_ptr) { void set_context(int rank, ThreadSHMContext* ptr, void* thread_shm_ptr) {
TORCH_CHECK(rank < MAX_SHM_RANK_NUM); TORCH_CHECK(rank < MAX_SHM_RANK_NUM);
TORCH_CHECK(ptr); TORCH_CHECK(ptr);
@ -95,27 +84,23 @@ struct ThreadSHMContext {
T* get_thread_shm_ptr(int rank) { T* get_thread_shm_ptr(int rank) {
return reinterpret_cast<T*>( return reinterpret_cast<T*>(
reinterpret_cast<int8_t*>(thread_shm_ptrs[rank]) + reinterpret_cast<int8_t*>(thread_shm_ptrs[rank]) +
(PER_THREAD_SHM_BUFFER_OFFSET & (PER_THREAD_SHM_BUFFER_OFFSET & _thread_buffer_mask));
_thread_buffer_mask[local_stamp_buffer_idx]));
} }
void next_buffer() { void next_buffer() { _thread_buffer_mask ^= 0xFFFFFFFFFFFFFFFF; }
_thread_buffer_mask[local_stamp_buffer_idx] ^= 0xFFFFFFFFFFFFFFFF;
}
char get_curr_stamp(int idx) const { return _curr_thread_stamp[idx]; } char get_curr_stamp() const { return _curr_thread_stamp; }
char get_ready_stamp(int idx) const { return _ready_thread_stamp[idx]; } char get_ready_stamp() const { return _ready_thread_stamp; }
void next_stamp() { void next_stamp() {
_mm_mfence(); _mm_mfence();
_curr_thread_stamp[local_stamp_buffer_idx] += 1; _curr_thread_stamp += 1;
} }
void commit_ready_stamp() { void commit_ready_stamp() {
_mm_mfence(); _mm_mfence();
_ready_thread_stamp[local_stamp_buffer_idx] = _ready_thread_stamp = _curr_thread_stamp;
_curr_thread_stamp[local_stamp_buffer_idx];
} }
int get_swizzled_rank(int idx) { return swizzled_ranks[idx]; } int get_swizzled_rank(int idx) { return swizzled_ranks[idx]; }
@ -132,11 +117,10 @@ struct ThreadSHMContext {
void wait_for_one(int rank, Cond&& cond) { void wait_for_one(int rank, Cond&& cond) {
ThreadSHMContext* rank_ctx = shm_contexts[rank]; ThreadSHMContext* rank_ctx = shm_contexts[rank];
for (;;) { for (;;) {
char local_curr_stamp = get_curr_stamp(local_stamp_buffer_idx); char local_curr_stamp = get_curr_stamp();
char local_ready_stamp = get_ready_stamp(local_stamp_buffer_idx); char local_ready_stamp = get_ready_stamp();
char rank_curr_stamp = rank_ctx->get_curr_stamp(remote_stamp_buffer_idx); char rank_curr_stamp = rank_ctx->get_curr_stamp();
char rank_ready_stamp = char rank_ready_stamp = rank_ctx->get_ready_stamp();
rank_ctx->get_ready_stamp(remote_stamp_buffer_idx);
if (cond(local_curr_stamp, local_ready_stamp, rank_curr_stamp, if (cond(local_curr_stamp, local_ready_stamp, rank_curr_stamp,
rank_ready_stamp)) { rank_ready_stamp)) {
break; break;
@ -377,15 +361,6 @@ void shm_cc_loop(ThreadSHMContext* ctx, int64_t elem_num, F&& inner_func) {
} }
} }
} }
void reset_threads_stamp_buffer_idx(ThreadSHMContext* ctx, int local,
int remote) {
int thread_num = ctx->thread_num;
for (int i = 0; i < thread_num; ++i) {
ThreadSHMContext* thread_ctx = ctx + i;
thread_ctx->set_stamp_buffer_idx(local, remote);
}
}
}; // namespace shm_cc_ops }; // namespace shm_cc_ops
namespace shm_cc_ops { namespace shm_cc_ops {
@ -657,7 +632,6 @@ void shm_send_tensor_list_impl(ThreadSHMContext* ctx, int64_t dst,
TensorListMeta* metadata = new (metadata_tensor.data_ptr()) TensorListMeta(); TensorListMeta* metadata = new (metadata_tensor.data_ptr()) TensorListMeta();
metadata->bind_tensor_list(tensor_list_with_metadata); metadata->bind_tensor_list(tensor_list_with_metadata);
shm_cc_ops::reset_threads_stamp_buffer_idx(ctx, 0, 1);
shm_cc_ops::shm_cc_loop<int8_t>( shm_cc_ops::shm_cc_loop<int8_t>(
ctx, metadata->total_bytes, ctx, metadata->total_bytes,
[&](ThreadSHMContext* thread_ctx, int64_t data_offset, [&](ThreadSHMContext* thread_ctx, int64_t data_offset,
@ -685,7 +659,6 @@ std::vector<torch::Tensor> shm_recv_tensor_list_impl(ThreadSHMContext* ctx,
torch::Tensor metadata_tensor = torch::Tensor metadata_tensor =
torch::empty({sizeof(TensorListMeta)}, options); torch::empty({sizeof(TensorListMeta)}, options);
shm_cc_ops::reset_threads_stamp_buffer_idx(ctx, 1, 0);
ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready); ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready);
shm_cc_ops::memcpy(metadata_tensor.data_ptr(), shm_cc_ops::memcpy(metadata_tensor.data_ptr(),
ctx->get_thread_shm_ptr<void>(src), ctx->get_thread_shm_ptr<void>(src),
@ -704,7 +677,7 @@ std::vector<torch::Tensor> shm_recv_tensor_list_impl(ThreadSHMContext* ctx,
ctx, metadata.total_bytes, ctx, metadata.total_bytes,
[&](ThreadSHMContext* thread_ctx, int64_t data_offset, [&](ThreadSHMContext* thread_ctx, int64_t data_offset,
int64_t data_elem_num, bool fast_mode) { int64_t data_elem_num, bool fast_mode) {
thread_ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready); ctx->wait_for_one(src, ThreadSHMContext::check_stamp_ready);
int64_t curr_shm_offset = 0; int64_t curr_shm_offset = 0;
while (curr_shm_offset < data_elem_num) { while (curr_shm_offset < data_elem_num) {
MemPiece frag = metadata.get_data(data_offset + curr_shm_offset); MemPiece frag = metadata.get_data(data_offset + curr_shm_offset);

View File

@ -151,7 +151,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("rotary_embedding", torch::kCPU, &rotary_embedding); ops.impl("rotary_embedding", torch::kCPU, &rotary_embedding);
// Quantization // Quantization
#if defined(__AVX512F__) || (defined(__aarch64__) && !defined(__APPLE__)) #if defined(__AVX512F__) || defined(__aarch64__)
at::Tag stride_tag = at::Tag::needs_fixed_stride_order; at::Tag stride_tag = at::Tag::needs_fixed_stride_order;
// Compute int8 quantized tensor for given scaling factor. // Compute int8 quantized tensor for given scaling factor.

View File

@ -4,37 +4,10 @@
#include <hip/hip_runtime.h> #include <hip/hip_runtime.h>
#endif #endif
#ifdef USE_ROCM #ifndef USE_ROCM
struct Utils {
static __host__ int get_warp_size() {
static bool is_cached = false;
static int result;
if (!is_cached) {
int device_id;
cudaDeviceProp deviceProp;
cudaGetDevice(&device_id);
cudaGetDeviceProperties(&deviceProp, device_id);
result = deviceProp.warpSize;
is_cached = true;
}
return result;
}
static __device__ constexpr int get_warp_size() {
#ifdef __GFX9__
return 64;
#else
return 32;
#endif
}
};
#define WARP_SIZE Utils::get_warp_size()
#else
#define WARP_SIZE 32 #define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif #endif
#ifndef USE_ROCM #ifndef USE_ROCM

View File

@ -15,16 +15,15 @@ namespace vllm {
// TODO(woosuk): Further optimize this kernel. // TODO(woosuk): Further optimize this kernel.
template <typename scalar_t> template <typename scalar_t>
__global__ void rms_norm_kernel( __global__ void rms_norm_kernel(
scalar_t* __restrict__ out, // [..., hidden_size] scalar_t* __restrict__ out, // [..., hidden_size]
const scalar_t* __restrict__ input, // [..., hidden_size] const scalar_t* __restrict__ input, // [..., hidden_size]
const int64_t input_stride,
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float epsilon, const int num_tokens, const int hidden_size) { const float epsilon, const int num_tokens, const int hidden_size) {
__shared__ float s_variance; __shared__ float s_variance;
float variance = 0.0f; float variance = 0.0f;
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
const float x = (float)input[blockIdx.x * input_stride + idx]; const float x = (float)input[blockIdx.x * hidden_size + idx];
variance += x * x; variance += x * x;
} }
@ -38,7 +37,7 @@ __global__ void rms_norm_kernel(
__syncthreads(); __syncthreads();
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
float x = (float)input[blockIdx.x * input_stride + idx]; float x = (float)input[blockIdx.x * hidden_size + idx];
out[blockIdx.x * hidden_size + idx] = out[blockIdx.x * hidden_size + idx] =
((scalar_t)(x * s_variance)) * weight[idx]; ((scalar_t)(x * s_variance)) * weight[idx];
} }
@ -51,8 +50,7 @@ __global__ void rms_norm_kernel(
template <typename scalar_t, int width> template <typename scalar_t, int width>
__global__ std::enable_if_t<(width > 0) && _typeConvert<scalar_t>::exists> __global__ std::enable_if_t<(width > 0) && _typeConvert<scalar_t>::exists>
fused_add_rms_norm_kernel( fused_add_rms_norm_kernel(
scalar_t* __restrict__ input, // [..., hidden_size] scalar_t* __restrict__ input, // [..., hidden_size]
const int64_t input_stride,
scalar_t* __restrict__ residual, // [..., hidden_size] scalar_t* __restrict__ residual, // [..., hidden_size]
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float epsilon, const int num_tokens, const int hidden_size) { const float epsilon, const int num_tokens, const int hidden_size) {
@ -61,7 +59,6 @@ fused_add_rms_norm_kernel(
static_assert(sizeof(_f16Vec<scalar_t, width>) == sizeof(scalar_t) * width); static_assert(sizeof(_f16Vec<scalar_t, width>) == sizeof(scalar_t) * width);
const int vec_hidden_size = hidden_size / width; const int vec_hidden_size = hidden_size / width;
const int64_t vec_input_stride = input_stride / width;
__shared__ float s_variance; __shared__ float s_variance;
float variance = 0.0f; float variance = 0.0f;
/* These and the argument pointers are all declared `restrict` as they are /* These and the argument pointers are all declared `restrict` as they are
@ -76,8 +73,7 @@ fused_add_rms_norm_kernel(
for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) {
int id = blockIdx.x * vec_hidden_size + idx; int id = blockIdx.x * vec_hidden_size + idx;
int64_t strided_id = blockIdx.x * vec_input_stride + idx; _f16Vec<scalar_t, width> temp = input_v[id];
_f16Vec<scalar_t, width> temp = input_v[strided_id];
temp += residual_v[id]; temp += residual_v[id];
variance += temp.sum_squares(); variance += temp.sum_squares();
residual_v[id] = temp; residual_v[id] = temp;
@ -94,11 +90,10 @@ fused_add_rms_norm_kernel(
for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) {
int id = blockIdx.x * vec_hidden_size + idx; int id = blockIdx.x * vec_hidden_size + idx;
int64_t strided_id = blockIdx.x * vec_input_stride + idx;
_f16Vec<scalar_t, width> temp = residual_v[id]; _f16Vec<scalar_t, width> temp = residual_v[id];
temp *= s_variance; temp *= s_variance;
temp *= weight_v[idx]; temp *= weight_v[idx];
input_v[strided_id] = temp; input_v[id] = temp;
} }
} }
@ -108,8 +103,7 @@ fused_add_rms_norm_kernel(
template <typename scalar_t, int width> template <typename scalar_t, int width>
__global__ std::enable_if_t<(width == 0) || !_typeConvert<scalar_t>::exists> __global__ std::enable_if_t<(width == 0) || !_typeConvert<scalar_t>::exists>
fused_add_rms_norm_kernel( fused_add_rms_norm_kernel(
scalar_t* __restrict__ input, // [..., hidden_size] scalar_t* __restrict__ input, // [..., hidden_size]
const int64_t input_stride,
scalar_t* __restrict__ residual, // [..., hidden_size] scalar_t* __restrict__ residual, // [..., hidden_size]
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float epsilon, const int num_tokens, const int hidden_size) { const float epsilon, const int num_tokens, const int hidden_size) {
@ -117,7 +111,7 @@ fused_add_rms_norm_kernel(
float variance = 0.0f; float variance = 0.0f;
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
scalar_t z = input[blockIdx.x * input_stride + idx]; scalar_t z = input[blockIdx.x * hidden_size + idx];
z += residual[blockIdx.x * hidden_size + idx]; z += residual[blockIdx.x * hidden_size + idx];
float x = (float)z; float x = (float)z;
variance += x * x; variance += x * x;
@ -135,7 +129,7 @@ fused_add_rms_norm_kernel(
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
float x = (float)residual[blockIdx.x * hidden_size + idx]; float x = (float)residual[blockIdx.x * hidden_size + idx];
input[blockIdx.x * input_stride + idx] = input[blockIdx.x * hidden_size + idx] =
((scalar_t)(x * s_variance)) * weight[idx]; ((scalar_t)(x * s_variance)) * weight[idx];
} }
} }
@ -147,12 +141,11 @@ void rms_norm(torch::Tensor& out, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size] torch::Tensor& weight, // [hidden_size]
double epsilon) { double epsilon) {
TORCH_CHECK(out.is_contiguous()); TORCH_CHECK(out.is_contiguous());
TORCH_CHECK(input.stride(-1) == 1); TORCH_CHECK(input.is_contiguous());
TORCH_CHECK(weight.is_contiguous()); TORCH_CHECK(weight.is_contiguous());
int hidden_size = input.size(-1); int hidden_size = input.size(-1);
int num_tokens = input.numel() / hidden_size; int num_tokens = input.numel() / hidden_size;
int64_t input_stride = input.stride(-2);
dim3 grid(num_tokens); dim3 grid(num_tokens);
dim3 block(std::min(hidden_size, 1024)); dim3 block(std::min(hidden_size, 1024));
@ -160,29 +153,26 @@ void rms_norm(torch::Tensor& out, // [..., hidden_size]
const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] { VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] {
vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>( vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), input_stride, out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(),
weight.data_ptr<scalar_t>(), epsilon, num_tokens, hidden_size); weight.data_ptr<scalar_t>(), epsilon, num_tokens, hidden_size);
}); });
} }
#define LAUNCH_FUSED_ADD_RMS_NORM(width) \ #define LAUNCH_FUSED_ADD_RMS_NORM(width) \
VLLM_DISPATCH_FLOATING_TYPES( \ VLLM_DISPATCH_FLOATING_TYPES( \
input.scalar_type(), "fused_add_rms_norm_kernel", [&] { \ input.scalar_type(), "fused_add_rms_norm_kernel", [&] { \
vllm::fused_add_rms_norm_kernel<scalar_t, width> \ vllm::fused_add_rms_norm_kernel<scalar_t, width> \
<<<grid, block, 0, stream>>>( \ <<<grid, block, 0, stream>>>(input.data_ptr<scalar_t>(), \
input.data_ptr<scalar_t>(), input_stride, \ residual.data_ptr<scalar_t>(), \
residual.data_ptr<scalar_t>(), weight.data_ptr<scalar_t>(), \ weight.data_ptr<scalar_t>(), epsilon, \
epsilon, num_tokens, hidden_size); \ num_tokens, hidden_size); \
}); });
void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size] void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
torch::Tensor& residual, // [..., hidden_size] torch::Tensor& residual, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size] torch::Tensor& weight, // [hidden_size]
double epsilon) { double epsilon) {
TORCH_CHECK(residual.is_contiguous());
TORCH_CHECK(weight.is_contiguous());
int hidden_size = input.size(-1); int hidden_size = input.size(-1);
int64_t input_stride = input.stride(-2);
int num_tokens = input.numel() / hidden_size; int num_tokens = input.numel() / hidden_size;
dim3 grid(num_tokens); dim3 grid(num_tokens);
@ -204,16 +194,9 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
auto inp_ptr = reinterpret_cast<std::uintptr_t>(input.data_ptr()); auto inp_ptr = reinterpret_cast<std::uintptr_t>(input.data_ptr());
auto res_ptr = reinterpret_cast<std::uintptr_t>(residual.data_ptr()); auto res_ptr = reinterpret_cast<std::uintptr_t>(residual.data_ptr());
auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr()); auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr());
constexpr int vector_width = 8; bool ptrs_are_aligned =
constexpr int req_alignment_bytes = inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0;
vector_width * 2; // vector_width * sizeof(bfloat16 or float16) (float32 if (ptrs_are_aligned && hidden_size % 8 == 0) {
// falls back to non-vectorized version anyway)
bool ptrs_are_aligned = inp_ptr % req_alignment_bytes == 0 &&
res_ptr % req_alignment_bytes == 0 &&
wt_ptr % req_alignment_bytes == 0;
bool offsets_are_multiple_of_vector_width =
hidden_size % vector_width == 0 && input_stride % vector_width == 0;
if (ptrs_are_aligned && offsets_are_multiple_of_vector_width) {
LAUNCH_FUSED_ADD_RMS_NORM(8); LAUNCH_FUSED_ADD_RMS_NORM(8);
} else { } else {
LAUNCH_FUSED_ADD_RMS_NORM(0); LAUNCH_FUSED_ADD_RMS_NORM(0);

View File

@ -23,9 +23,8 @@ namespace vllm {
// TODO(woosuk): Further optimize this kernel. // TODO(woosuk): Further optimize this kernel.
template <typename scalar_t, typename fp8_type> template <typename scalar_t, typename fp8_type>
__global__ void rms_norm_static_fp8_quant_kernel( __global__ void rms_norm_static_fp8_quant_kernel(
fp8_type* __restrict__ out, // [..., hidden_size] fp8_type* __restrict__ out, // [..., hidden_size]
const scalar_t* __restrict__ input, // [..., hidden_size] const scalar_t* __restrict__ input, // [..., hidden_size]
const int input_stride,
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float* __restrict__ scale, // [1] const float* __restrict__ scale, // [1]
const float epsilon, const int num_tokens, const int hidden_size) { const float epsilon, const int num_tokens, const int hidden_size) {
@ -33,7 +32,7 @@ __global__ void rms_norm_static_fp8_quant_kernel(
float variance = 0.0f; float variance = 0.0f;
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
const float x = (float)input[blockIdx.x * input_stride + idx]; const float x = (float)input[blockIdx.x * hidden_size + idx];
variance += x * x; variance += x * x;
} }
@ -50,7 +49,7 @@ __global__ void rms_norm_static_fp8_quant_kernel(
float const scale_inv = 1.0f / *scale; float const scale_inv = 1.0f / *scale;
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
float x = (float)input[blockIdx.x * input_stride + idx]; float x = (float)input[blockIdx.x * hidden_size + idx];
float const out_norm = ((scalar_t)(x * s_variance)) * weight[idx]; float const out_norm = ((scalar_t)(x * s_variance)) * weight[idx];
out[blockIdx.x * hidden_size + idx] = out[blockIdx.x * hidden_size + idx] =
scaled_fp8_conversion<true, fp8_type>(out_norm, scale_inv); scaled_fp8_conversion<true, fp8_type>(out_norm, scale_inv);
@ -64,9 +63,8 @@ __global__ void rms_norm_static_fp8_quant_kernel(
template <typename scalar_t, int width, typename fp8_type> template <typename scalar_t, int width, typename fp8_type>
__global__ std::enable_if_t<(width > 0) && _typeConvert<scalar_t>::exists> __global__ std::enable_if_t<(width > 0) && _typeConvert<scalar_t>::exists>
fused_add_rms_norm_static_fp8_quant_kernel( fused_add_rms_norm_static_fp8_quant_kernel(
fp8_type* __restrict__ out, // [..., hidden_size] fp8_type* __restrict__ out, // [..., hidden_size]
scalar_t* __restrict__ input, // [..., hidden_size] scalar_t* __restrict__ input, // [..., hidden_size]
const int input_stride,
scalar_t* __restrict__ residual, // [..., hidden_size] scalar_t* __restrict__ residual, // [..., hidden_size]
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float* __restrict__ scale, // [1] const float* __restrict__ scale, // [1]
@ -76,7 +74,6 @@ fused_add_rms_norm_static_fp8_quant_kernel(
static_assert(sizeof(_f16Vec<scalar_t, width>) == sizeof(scalar_t) * width); static_assert(sizeof(_f16Vec<scalar_t, width>) == sizeof(scalar_t) * width);
const int vec_hidden_size = hidden_size / width; const int vec_hidden_size = hidden_size / width;
const int vec_input_stride = input_stride / width;
__shared__ float s_variance; __shared__ float s_variance;
float variance = 0.0f; float variance = 0.0f;
/* These and the argument pointers are all declared `restrict` as they are /* These and the argument pointers are all declared `restrict` as they are
@ -90,9 +87,8 @@ fused_add_rms_norm_static_fp8_quant_kernel(
reinterpret_cast<const _f16Vec<scalar_t, width>*>(weight); reinterpret_cast<const _f16Vec<scalar_t, width>*>(weight);
for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) {
int stride_id = blockIdx.x * vec_input_stride + idx;
int id = blockIdx.x * vec_hidden_size + idx; int id = blockIdx.x * vec_hidden_size + idx;
_f16Vec<scalar_t, width> temp = input_v[stride_id]; _f16Vec<scalar_t, width> temp = input_v[id];
temp += residual_v[id]; temp += residual_v[id];
variance += temp.sum_squares(); variance += temp.sum_squares();
residual_v[id] = temp; residual_v[id] = temp;
@ -129,9 +125,8 @@ fused_add_rms_norm_static_fp8_quant_kernel(
template <typename scalar_t, int width, typename fp8_type> template <typename scalar_t, int width, typename fp8_type>
__global__ std::enable_if_t<(width == 0) || !_typeConvert<scalar_t>::exists> __global__ std::enable_if_t<(width == 0) || !_typeConvert<scalar_t>::exists>
fused_add_rms_norm_static_fp8_quant_kernel( fused_add_rms_norm_static_fp8_quant_kernel(
fp8_type* __restrict__ out, // [..., hidden_size] fp8_type* __restrict__ out, // [..., hidden_size]
scalar_t* __restrict__ input, // [..., hidden_size] scalar_t* __restrict__ input, // [..., hidden_size]
const int input_stride,
scalar_t* __restrict__ residual, // [..., hidden_size] scalar_t* __restrict__ residual, // [..., hidden_size]
const scalar_t* __restrict__ weight, // [hidden_size] const scalar_t* __restrict__ weight, // [hidden_size]
const float* __restrict__ scale, // [1] const float* __restrict__ scale, // [1]
@ -140,7 +135,7 @@ fused_add_rms_norm_static_fp8_quant_kernel(
float variance = 0.0f; float variance = 0.0f;
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) { for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
scalar_t z = input[blockIdx.x * input_stride + idx]; scalar_t z = input[blockIdx.x * hidden_size + idx];
z += residual[blockIdx.x * hidden_size + idx]; z += residual[blockIdx.x * hidden_size + idx];
float x = (float)z; float x = (float)z;
variance += x * x; variance += x * x;
@ -174,9 +169,7 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size] torch::Tensor& weight, // [hidden_size]
torch::Tensor& scale, // [1] torch::Tensor& scale, // [1]
double epsilon) { double epsilon) {
TORCH_CHECK(out.is_contiguous());
int hidden_size = input.size(-1); int hidden_size = input.size(-1);
int input_stride = input.stride(-2);
int num_tokens = input.numel() / hidden_size; int num_tokens = input.numel() / hidden_size;
dim3 grid(num_tokens); dim3 grid(num_tokens);
@ -190,9 +183,8 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size]
vllm::rms_norm_static_fp8_quant_kernel<scalar_t, fp8_t> vllm::rms_norm_static_fp8_quant_kernel<scalar_t, fp8_t>
<<<grid, block, 0, stream>>>( <<<grid, block, 0, stream>>>(
out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(), out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(),
input_stride, weight.data_ptr<scalar_t>(), weight.data_ptr<scalar_t>(), scale.data_ptr<float>(),
scale.data_ptr<float>(), epsilon, num_tokens, epsilon, num_tokens, hidden_size);
hidden_size);
}); });
}); });
} }
@ -206,7 +198,7 @@ void rms_norm_static_fp8_quant(torch::Tensor& out, // [..., hidden_size]
width, fp8_t> \ width, fp8_t> \
<<<grid, block, 0, stream>>>( \ <<<grid, block, 0, stream>>>( \
out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(), \ out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(), \
input_stride, residual.data_ptr<scalar_t>(), \ residual.data_ptr<scalar_t>(), \
weight.data_ptr<scalar_t>(), scale.data_ptr<float>(), \ weight.data_ptr<scalar_t>(), scale.data_ptr<float>(), \
epsilon, num_tokens, hidden_size); \ epsilon, num_tokens, hidden_size); \
}); \ }); \
@ -218,10 +210,7 @@ void fused_add_rms_norm_static_fp8_quant(
torch::Tensor& weight, // [hidden_size] torch::Tensor& weight, // [hidden_size]
torch::Tensor& scale, // [1] torch::Tensor& scale, // [1]
double epsilon) { double epsilon) {
TORCH_CHECK(out.is_contiguous());
TORCH_CHECK(residual.is_contiguous());
int hidden_size = input.size(-1); int hidden_size = input.size(-1);
int input_stride = input.stride(-2);
int num_tokens = input.numel() / hidden_size; int num_tokens = input.numel() / hidden_size;
dim3 grid(num_tokens); dim3 grid(num_tokens);
@ -245,7 +234,7 @@ void fused_add_rms_norm_static_fp8_quant(
auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr()); auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr());
bool ptrs_are_aligned = bool ptrs_are_aligned =
inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0; inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0;
if (ptrs_are_aligned && hidden_size % 8 == 0 && input_stride % 8 == 0) { if (ptrs_are_aligned && hidden_size % 8 == 0) {
LAUNCH_FUSED_ADD_RMS_NORM(8); LAUNCH_FUSED_ADD_RMS_NORM(8);
} else { } else {
LAUNCH_FUSED_ADD_RMS_NORM(0); LAUNCH_FUSED_ADD_RMS_NORM(0);

View File

@ -1,7 +1,6 @@
#include <torch/all.h> #include <torch/all.h>
#include <ATen/cuda/CUDAContext.h> #include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h> #include <c10/cuda/CUDAGuard.h>
#include <cub/cub.cuh>
#include <ATen/ATen.h> #include <ATen/ATen.h>
#include <ATen/cuda/Atomic.cuh> #include <ATen/cuda/Atomic.cuh>
@ -20,14 +19,9 @@ __global__ void moe_align_block_size_kernel(
int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids, int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids,
int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts, int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts,
int32_t padded_num_experts, int32_t experts_per_warp, int32_t block_size, int32_t padded_num_experts, int32_t experts_per_warp, int32_t block_size,
size_t numel, int32_t* __restrict__ cumsum, int32_t max_num_tokens_padded) { size_t numel, int32_t* __restrict__ cumsum) {
extern __shared__ int32_t shared_counts[]; extern __shared__ int32_t shared_counts[];
// Initialize sorted_token_ids with numel
for (size_t it = threadIdx.x; it < max_num_tokens_padded; it += blockDim.x) {
sorted_token_ids[it] = numel;
}
const int warp_id = threadIdx.x / WARP_SIZE; const int warp_id = threadIdx.x / WARP_SIZE;
const int my_expert_start = warp_id * experts_per_warp; const int my_expert_start = warp_id * experts_per_warp;
@ -51,27 +45,18 @@ __global__ void moe_align_block_size_kernel(
__syncthreads(); __syncthreads();
// Compute prefix sum over token counts per expert if (threadIdx.x == 0) {
using BlockScan = cub::BlockScan<int32_t, 1024>; cumsum[0] = 0;
__shared__ typename BlockScan::TempStorage temp_storage; for (int i = 1; i <= num_experts; ++i) {
int expert_count = 0;
int warp_idx = (i - 1) / experts_per_warp;
int expert_offset = (i - 1) % experts_per_warp;
expert_count = shared_counts[warp_idx * experts_per_warp + expert_offset];
int expert_count = 0; cumsum[i] =
int expert_id = threadIdx.x; cumsum[i - 1] + CEILDIV(expert_count, block_size) * block_size;
if (expert_id < num_experts) { }
int warp_idx = expert_id / experts_per_warp; *total_tokens_post_pad = cumsum[num_experts];
int expert_offset = expert_id % experts_per_warp;
expert_count = shared_counts[warp_idx * experts_per_warp + expert_offset];
expert_count = CEILDIV(expert_count, block_size) * block_size;
}
int cumsum_val;
BlockScan(temp_storage).ExclusiveSum(expert_count, cumsum_val);
if (expert_id <= num_experts) {
cumsum[expert_id] = cumsum_val;
}
if (expert_id == num_experts) {
*total_tokens_post_pad = cumsum_val;
} }
__syncthreads(); __syncthreads();
@ -82,13 +67,6 @@ __global__ void moe_align_block_size_kernel(
expert_ids[i / block_size] = threadIdx.x; expert_ids[i / block_size] = threadIdx.x;
} }
} }
// Fill remaining expert_ids with 0
const size_t fill_start_idx = cumsum[num_experts] / block_size + threadIdx.x;
const size_t expert_ids_size = CEILDIV(max_num_tokens_padded, block_size);
for (size_t i = fill_start_idx; i < expert_ids_size; i += blockDim.x) {
expert_ids[i] = 0;
}
} }
template <typename scalar_t> template <typename scalar_t>
@ -127,12 +105,7 @@ __global__ void moe_align_block_size_small_batch_expert_kernel(
const scalar_t* __restrict__ topk_ids, const scalar_t* __restrict__ topk_ids,
int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids, int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids,
int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts, int32_t* __restrict__ total_tokens_post_pad, int32_t num_experts,
int32_t block_size, size_t numel, int32_t max_num_tokens_padded) { int32_t block_size, size_t numel) {
// Initialize sorted_token_ids with numel
for (size_t it = threadIdx.x; it < max_num_tokens_padded; it += blockDim.x) {
sorted_token_ids[it] = numel;
}
const size_t tid = threadIdx.x; const size_t tid = threadIdx.x;
const size_t stride = blockDim.x; const size_t stride = blockDim.x;
@ -180,13 +153,6 @@ __global__ void moe_align_block_size_small_batch_expert_kernel(
} }
} }
// Fill remaining expert_ids with 0
const size_t fill_start_idx = cumsum[num_experts] / block_size + threadIdx.x;
const size_t expert_ids_size = CEILDIV(max_num_tokens_padded, block_size);
for (size_t i = fill_start_idx; i < expert_ids_size; i += blockDim.x) {
expert_ids[i] = 0;
}
for (size_t i = tid; i < numel; i += stride) { for (size_t i = tid; i < numel; i += stride) {
int32_t expert_id = topk_ids[i]; int32_t expert_id = topk_ids[i];
int32_t rank_post_pad = int32_t rank_post_pad =
@ -213,17 +179,13 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
int threads = 1024; int threads = 1024;
threads = ((threads + WARP_SIZE - 1) / WARP_SIZE) * WARP_SIZE; threads = ((threads + WARP_SIZE - 1) / WARP_SIZE) * WARP_SIZE;
// BlockScan uses 1024 threads and assigns one thread per expert.
TORCH_CHECK(padded_num_experts < 1024,
"padded_num_experts must be less than 1024");
VLLM_DISPATCH_INTEGRAL_AND_UNSIGNED_TYPES( VLLM_DISPATCH_INTEGRAL_AND_UNSIGNED_TYPES(
topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] { topk_ids.scalar_type(), "moe_align_block_size_kernel", [&] {
// calc needed amount of shared mem for `cumsum` tensors // calc needed amount of shared mem for `cumsum` tensors
auto options_int = auto options_int =
torch::TensorOptions().dtype(torch::kInt).device(topk_ids.device()); torch::TensorOptions().dtype(torch::kInt).device(topk_ids.device());
torch::Tensor cumsum_buffer = torch::Tensor cumsum_buffer =
torch::empty({num_experts + 1}, options_int); torch::zeros({num_experts + 1}, options_int);
bool small_batch_expert_mode = bool small_batch_expert_mode =
(topk_ids.numel() < 1024) && (num_experts <= 64); (topk_ids.numel() < 1024) && (num_experts <= 64);
@ -241,7 +203,7 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
sorted_token_ids.data_ptr<int32_t>(), sorted_token_ids.data_ptr<int32_t>(),
experts_ids.data_ptr<int32_t>(), experts_ids.data_ptr<int32_t>(),
num_tokens_post_pad.data_ptr<int32_t>(), num_experts, block_size, num_tokens_post_pad.data_ptr<int32_t>(), num_experts, block_size,
topk_ids.numel(), sorted_token_ids.size(0)); topk_ids.numel());
} else { } else {
auto align_kernel = vllm::moe::moe_align_block_size_kernel<scalar_t>; auto align_kernel = vllm::moe::moe_align_block_size_kernel<scalar_t>;
@ -255,8 +217,7 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
experts_ids.data_ptr<int32_t>(), experts_ids.data_ptr<int32_t>(),
num_tokens_post_pad.data_ptr<int32_t>(), num_experts, num_tokens_post_pad.data_ptr<int32_t>(), num_experts,
padded_num_experts, experts_per_warp, block_size, padded_num_experts, experts_per_warp, block_size,
topk_ids.numel(), cumsum_buffer.data_ptr<int32_t>(), topk_ids.numel(), cumsum_buffer.data_ptr<int32_t>());
sorted_token_ids.size(0));
const int block_threads = std::min(256, (int)threads); const int block_threads = std::min(256, (int)threads);
const int num_blocks = const int num_blocks =

View File

@ -10,28 +10,32 @@
void moe_permute( void moe_permute(
const torch::Tensor& input, // [n_token, hidden] const torch::Tensor& input, // [n_token, hidden]
const torch::Tensor& topk_ids, // [n_token, topk] const torch::Tensor& topk_weights, //[n_token, topk]
torch::Tensor& topk_ids, // [n_token, topk]
const torch::Tensor& token_expert_indices, // [n_token, topk] const torch::Tensor& token_expert_indices, // [n_token, topk]
const std::optional<torch::Tensor>& expert_map, // [n_expert] const std::optional<torch::Tensor>& expert_map, // [n_expert]
int64_t n_expert, int64_t n_local_expert, int64_t topk, int64_t n_expert, int64_t n_local_expert, int64_t topk,
const std::optional<int64_t>& align_block_size, const std::optional<int64_t>& align_block_size,
torch::Tensor& permuted_input, // [permuted_size, hidden] torch::Tensor&
permuted_input, // [topk * n_token/align_block_size_m, hidden]
torch::Tensor& expert_first_token_offset, // [n_local_expert + 1] torch::Tensor& expert_first_token_offset, // [n_local_expert + 1]
torch::Tensor& inv_permuted_idx, // [n_token, topk] torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
torch::Tensor& permuted_idx, // [permute_size]
torch::Tensor& m_indices) { // [align_expand_m] torch::Tensor& m_indices) { // [align_expand_m]
TORCH_CHECK(topk_weights.scalar_type() == at::ScalarType::Float,
"topk_weights must be float32");
TORCH_CHECK(expert_first_token_offset.scalar_type() == at::ScalarType::Long, TORCH_CHECK(expert_first_token_offset.scalar_type() == at::ScalarType::Long,
"expert_first_token_offset must be int64"); "expert_first_token_offset must be int64");
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int, TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
"topk_ids must be int32"); "topk_ids must be int32");
TORCH_CHECK(token_expert_indices.scalar_type() == at::ScalarType::Int, TORCH_CHECK(token_expert_indices.scalar_type() == at::ScalarType::Int,
"token_expert_indices must be int32"); "token_expert_indices must be int32");
TORCH_CHECK(inv_permuted_idx.scalar_type() == at::ScalarType::Int, TORCH_CHECK(src_row_id2dst_row_id_map.scalar_type() == at::ScalarType::Int,
"inv_permuted_idx must be int32"); "src_row_id2dst_row_id_map must be int32");
TORCH_CHECK(expert_first_token_offset.size(0) == n_local_expert + 1, TORCH_CHECK(expert_first_token_offset.size(0) == n_local_expert + 1,
"expert_first_token_offset shape != n_local_expert+1") "expert_first_token_offset shape != n_local_expert+1")
TORCH_CHECK(inv_permuted_idx.sizes() == token_expert_indices.sizes(), TORCH_CHECK(
"token_expert_indices shape must be same as inv_permuted_idx"); src_row_id2dst_row_id_map.sizes() == token_expert_indices.sizes(),
"token_expert_indices shape must be same as src_row_id2dst_row_id_map");
auto n_token = input.sizes()[0]; auto n_token = input.sizes()[0];
auto n_hidden = input.sizes()[1]; auto n_hidden = input.sizes()[1];
auto align_block_size_value = auto align_block_size_value =
@ -42,9 +46,8 @@ void moe_permute(
auto sort_workspace = torch::empty( auto sort_workspace = torch::empty(
{sorter_size}, {sorter_size},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false)); torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
auto copy_topk_ids = topk_ids.clone(); // copy topk_ids for preprocess
auto permuted_experts_id = torch::empty_like(topk_ids); auto permuted_experts_id = torch::empty_like(topk_ids);
auto sorted_row_idx = torch::empty_like(inv_permuted_idx); auto dst_row_id2src_row_id_map = torch::empty_like(src_row_id2dst_row_id_map);
auto align_expert_first_token_offset = auto align_expert_first_token_offset =
torch::zeros_like(expert_first_token_offset); torch::zeros_like(expert_first_token_offset);
@ -64,22 +67,24 @@ void moe_permute(
const int* expert_map_ptr = get_ptr<int>(expert_map.value()); const int* expert_map_ptr = get_ptr<int>(expert_map.value());
valid_num_ptr = valid_num_ptr =
get_ptr<int64_t>(expert_first_token_offset) + n_local_expert; get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
preprocessTopkIdLauncher(get_ptr<int>(copy_topk_ids), n_token * topk, preprocessTopkIdLauncher(get_ptr<int>(topk_ids), n_token * topk,
expert_map_ptr, n_expert, stream); expert_map_ptr, n_expert, stream);
} }
// expert sort topk expert id and scan expert id get expert_first_token_offset // expert sort topk expert id and scan expert id get expert_first_token_offset
sortAndScanExpert( sortAndScanExpert(get_ptr<int>(topk_ids), get_ptr<int>(token_expert_indices),
get_ptr<int>(copy_topk_ids), get_ptr<int>(token_expert_indices), get_ptr<int>(permuted_experts_id),
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx), get_ptr<int>(dst_row_id2src_row_id_map),
get_ptr<int64_t>(expert_first_token_offset), n_token, n_expert, get_ptr<int64_t>(expert_first_token_offset), n_token,
n_local_expert, topk, sorter, get_ptr<int>(sort_workspace), stream); n_expert, n_local_expert, topk, sorter,
get_ptr<int>(sort_workspace), stream);
// dispatch expandInputRowsKernelLauncher // dispatch expandInputRowsKernelLauncher
MOE_DISPATCH(input.scalar_type(), [&] { MOE_DISPATCH(input.scalar_type(), [&] {
expandInputRowsKernelLauncher<scalar_t>( expandInputRowsKernelLauncher<scalar_t>(
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input), get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx), get_ptr<float>(topk_weights), get_ptr<int>(permuted_experts_id),
get_ptr<int>(inv_permuted_idx), get_ptr<int>(permuted_idx), get_ptr<int>(dst_row_id2src_row_id_map),
get_ptr<int>(src_row_id2dst_row_id_map),
get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr, get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr,
n_hidden, topk, n_local_expert, align_block_size_value, stream); n_hidden, topk, n_local_expert, align_block_size_value, stream);
}); });
@ -96,34 +101,32 @@ void moe_permute(
} }
void moe_unpermute( void moe_unpermute(
const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden] const torch::Tensor& permuted_hidden_states, // [n_token * topk, hidden]
const torch::Tensor& topk_weights, // [n_token, topk] const torch::Tensor& topk_weights, //[n_token, topk]
const torch::Tensor& inv_permuted_idx, // [n_token, topk] const torch::Tensor& topk_ids, // [n_token, topk]
const std::optional<torch::Tensor>& const torch::Tensor& src_row_id2dst_row_id_map, // [n_token, topk]
expert_first_token_offset, // [n_local_expert+1] const torch::Tensor& expert_first_token_offset, // [n_local_expert+1]
int64_t topk, int64_t n_expert, int64_t n_local_expert, int64_t topk,
torch::Tensor& hidden_states // [n_token, hidden] torch::Tensor& hidden_states // [n_token, hidden]
) { ) {
TORCH_CHECK(src_row_id2dst_row_id_map.sizes() == topk_ids.sizes(),
"topk_ids shape must be same as src_row_id2dst_row_id_map");
TORCH_CHECK(topk_ids.scalar_type() == at::ScalarType::Int,
"topk_ids must be int32");
TORCH_CHECK( TORCH_CHECK(
permuted_hidden_states.scalar_type() == hidden_states.scalar_type(), permuted_hidden_states.scalar_type() == hidden_states.scalar_type(),
"permuted_hidden_states dtype must be same as hidden_states"); "topk_ids dtype must be same as src_row_id2dst_row_id_map");
auto n_token = hidden_states.size(0); auto n_token = hidden_states.size(0);
auto n_hidden = hidden_states.size(1); auto n_hidden = hidden_states.size(1);
auto stream = at::cuda::getCurrentCUDAStream().stream(); auto stream = at::cuda::getCurrentCUDAStream().stream();
const int64_t* valid_ptr =
int64_t const* valid_ptr = nullptr; get_ptr<int64_t>(expert_first_token_offset) + n_local_expert;
if (expert_first_token_offset.has_value()) {
int n_local_expert = expert_first_token_offset.value().size(0) - 1;
valid_ptr =
get_ptr<int64_t>(expert_first_token_offset.value()) + n_local_expert;
}
MOE_DISPATCH(hidden_states.scalar_type(), [&] { MOE_DISPATCH(hidden_states.scalar_type(), [&] {
finalizeMoeRoutingKernelLauncher<scalar_t, scalar_t>( finalizeMoeRoutingKernelLauncher<scalar_t, scalar_t>(
get_ptr<scalar_t>(permuted_hidden_states), get_ptr<scalar_t>(permuted_hidden_states),
get_ptr<scalar_t>(hidden_states), get_ptr<float>(topk_weights), get_ptr<scalar_t>(hidden_states), get_ptr<float>(topk_weights),
get_ptr<int>(inv_permuted_idx), n_token, n_hidden, topk, valid_ptr, get_ptr<int>(src_row_id2dst_row_id_map), get_ptr<int>(topk_ids),
stream); n_token, n_hidden, topk, valid_ptr, stream);
}); });
} }

View File

@ -177,7 +177,7 @@ __global__ void getMIndicesKernel(int64_t* expert_first_token_offset,
int tidx = threadIdx.x; int tidx = threadIdx.x;
extern __shared__ int64_t smem_expert_first_token_offset[]; extern __shared__ int64_t smem_expert_first_token_offset[];
for (int i = tidx; i <= num_local_expert; i += blockDim.x) { for (int i = tidx; i <= num_local_expert; i += blockDim.x) {
smem_expert_first_token_offset[i] = __ldg(expert_first_token_offset + i); smem_expert_first_token_offset[tidx] = __ldg(expert_first_token_offset + i);
} }
__syncthreads(); __syncthreads();
auto last_token_offset = smem_expert_first_token_offset[eidx + 1]; auto last_token_offset = smem_expert_first_token_offset[eidx + 1];

View File

@ -57,19 +57,31 @@ void sortAndScanExpert(int* expert_for_source_row, const int* source_rows,
template <typename T> template <typename T>
void expandInputRowsKernelLauncher( void expandInputRowsKernelLauncher(
T const* unpermuted_input, T* permuted_output, int* sorted_experts, T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row, int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx, int* expanded_source_row_to_expanded_dest_row,
int64_t* expert_first_token_offset, int64_t const num_rows, int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k, int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
int num_local_experts, const int& align_block_size, cudaStream_t stream); int num_local_experts, const int& align_block_size, cudaStream_t stream);
// Final kernel to unpermute and scale
// This kernel unpermutes the original data, does the k-way reduction and
// performs the final skip connection.
template <typename T, typename OutputType, bool CHECK_SKIPPED>
__global__ void finalizeMoeRoutingKernel(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
int64_t const* num_valid_ptr);
template <class T, class OutputType> template <class T, class OutputType>
void finalizeMoeRoutingKernelLauncher( void finalizeMoeRoutingKernelLauncher(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output, T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row, float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int64_t const num_rows, int64_t const cols, int64_t const k, int const* expert_for_source_row, int64_t const num_rows,
int64_t const* num_valid_ptr, cudaStream_t stream); int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
cudaStream_t stream);
void preprocessTopkIdLauncher(int* topk_id_ptr, int size, void preprocessTopkIdLauncher(int* topk_id_ptr, int size,
const int* expert_map_ptr, int num_experts, const int* expert_map_ptr, int num_experts,

View File

@ -2,9 +2,10 @@
template <typename T, bool CHECK_SKIPPED, bool ALIGN_BLOCK_SIZE> template <typename T, bool CHECK_SKIPPED, bool ALIGN_BLOCK_SIZE>
__global__ void expandInputRowsKernel( __global__ void expandInputRowsKernel(
T const* unpermuted_input, T* permuted_output, int* sorted_experts, T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row, int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx, int* expanded_source_row_to_expanded_dest_row,
int64_t* expert_first_token_offset, int64_t const num_rows, int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_dest_rows, int64_t const cols, int64_t k, int64_t const* num_dest_rows, int64_t const cols, int64_t k,
int num_local_experts, int align_block_size) { int num_local_experts, int align_block_size) {
@ -53,10 +54,6 @@ __global__ void expandInputRowsKernel(
assert(expanded_dest_row <= INT32_MAX); assert(expanded_dest_row <= INT32_MAX);
expanded_source_row_to_expanded_dest_row[expanded_source_row] = expanded_source_row_to_expanded_dest_row[expanded_source_row] =
static_cast<int>(expanded_dest_row); static_cast<int>(expanded_dest_row);
// skip non local expert token
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
permuted_idx[expanded_dest_row] = expanded_source_row;
}
} }
if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) { if (!CHECK_SKIPPED || blockIdx.x < *num_dest_rows) {
@ -65,7 +62,7 @@ __global__ void expandInputRowsKernel(
using DataElem = cutlass::Array<T, ELEM_PER_THREAD>; using DataElem = cutlass::Array<T, ELEM_PER_THREAD>;
// Duplicate and permute rows // Duplicate and permute rows
int64_t const source_row = expanded_source_row / k; int64_t const source_row = expanded_source_row % num_rows;
auto const* source_row_ptr = auto const* source_row_ptr =
reinterpret_cast<DataElem const*>(unpermuted_input + source_row * cols); reinterpret_cast<DataElem const*>(unpermuted_input + source_row * cols);
@ -85,9 +82,10 @@ __global__ void expandInputRowsKernel(
template <typename T> template <typename T>
void expandInputRowsKernelLauncher( void expandInputRowsKernelLauncher(
T const* unpermuted_input, T* permuted_output, int* sorted_experts, T const* unpermuted_input, T* permuted_output,
const float* unpermuted_scales, int* sorted_experts,
int const* expanded_dest_row_to_expanded_source_row, int const* expanded_dest_row_to_expanded_source_row,
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx, int* expanded_source_row_to_expanded_dest_row,
int64_t* expert_first_token_offset, int64_t const num_rows, int64_t* expert_first_token_offset, int64_t const num_rows,
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k, int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
int num_local_experts, const int& align_block_size, cudaStream_t stream) { int num_local_experts, const int& align_block_size, cudaStream_t stream) {
@ -107,11 +105,11 @@ void expandInputRowsKernelLauncher(
int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1); int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1);
func<<<blocks, threads, smem_size, stream>>>( func<<<blocks, threads, smem_size, stream>>>(
unpermuted_input, permuted_output, sorted_experts, unpermuted_input, permuted_output, unpermuted_scales, sorted_experts,
expanded_dest_row_to_expanded_source_row, expanded_dest_row_to_expanded_source_row,
expanded_source_row_to_expanded_dest_row, permuted_idx, expanded_source_row_to_expanded_dest_row, expert_first_token_offset,
expert_first_token_offset, num_rows, num_valid_tokens_ptr, cols, k, num_rows, num_valid_tokens_ptr, cols, k, num_local_experts,
num_local_experts, align_block_size); align_block_size);
} }
template <class T, class U> template <class T, class U>
@ -130,9 +128,11 @@ template <typename T, typename OutputType, bool CHECK_SKIPPED>
__global__ void finalizeMoeRoutingKernel( __global__ void finalizeMoeRoutingKernel(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output, T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row, float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int64_t const orig_cols, int64_t const k, int64_t const* num_valid_ptr) { int const* expert_for_source_row, int64_t const orig_cols, int64_t const k,
int64_t const* num_valid_ptr) {
assert(orig_cols % 4 == 0); assert(orig_cols % 4 == 0);
int64_t const original_row = blockIdx.x; int64_t const original_row = blockIdx.x;
int64_t const num_rows = gridDim.x;
auto const offset = original_row * orig_cols; auto const offset = original_row * orig_cols;
OutputType* reduced_row_ptr = reduced_unpermuted_output + offset; OutputType* reduced_row_ptr = reduced_unpermuted_output + offset;
int64_t const num_valid = *num_valid_ptr; int64_t const num_valid = *num_valid_ptr;
@ -159,13 +159,14 @@ __global__ void finalizeMoeRoutingKernel(
ComputeElem thread_output; ComputeElem thread_output;
thread_output.fill(0); thread_output.fill(0);
for (int k_idx = 0; k_idx < k; ++k_idx) { for (int k_idx = 0; k_idx < k; ++k_idx) {
int64_t const expanded_original_row = original_row * k + k_idx; int64_t const expanded_original_row = original_row + k_idx * num_rows;
int64_t const expanded_permuted_row = int64_t const expanded_permuted_row =
expanded_source_row_to_expanded_dest_row[expanded_original_row]; expanded_source_row_to_expanded_dest_row[expanded_original_row];
int64_t const k_offset = original_row * k + k_idx; int64_t const k_offset = original_row * k + k_idx;
float const row_scale = scales[k_offset]; float const row_scale = scales[k_offset];
// Check after row_rescale has accumulated
if (CHECK_SKIPPED && expanded_permuted_row >= num_valid) { if (CHECK_SKIPPED && expanded_permuted_row >= num_valid) {
continue; continue;
} }
@ -188,8 +189,9 @@ template <class T, class OutputType>
void finalizeMoeRoutingKernelLauncher( void finalizeMoeRoutingKernelLauncher(
T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output, T const* expanded_permuted_rows, OutputType* reduced_unpermuted_output,
float const* scales, int const* expanded_source_row_to_expanded_dest_row, float const* scales, int const* expanded_source_row_to_expanded_dest_row,
int64_t const num_rows, int64_t const cols, int64_t const k, int const* expert_for_source_row, int64_t const num_rows,
int64_t const* num_valid_ptr, cudaStream_t stream) { int64_t const cols, int64_t const k, int64_t const* num_valid_ptr,
cudaStream_t stream) {
int64_t const blocks = num_rows; int64_t const blocks = num_rows;
int64_t const threads = 256; int64_t const threads = 256;
bool const check_finished = num_valid_ptr != nullptr; bool const check_finished = num_valid_ptr != nullptr;
@ -199,5 +201,6 @@ void finalizeMoeRoutingKernelLauncher(
auto* const kernel = func_map[check_finished]; auto* const kernel = func_map[check_finished];
kernel<<<blocks, threads, 0, stream>>>( kernel<<<blocks, threads, 0, stream>>>(
expanded_permuted_rows, reduced_unpermuted_output, scales, expanded_permuted_rows, reduced_unpermuted_output, scales,
expanded_source_row_to_expanded_dest_row, cols, k, num_valid_ptr); expanded_source_row_to_expanded_dest_row, expert_for_source_row, cols, k,
num_valid_ptr);
} }

View File

@ -190,8 +190,8 @@ __launch_bounds__(TPB) __global__ void moeTopK(
2) This implementation assumes k is small, but will work for any k. 2) This implementation assumes k is small, but will work for any k.
*/ */
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename IndType> template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, typename IndType>
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE) __global__
void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices, void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices,
int* source_rows, const int k, const int start_expert, const int end_expert) int* source_rows, const int k, const int start_expert, const int end_expert)
{ {
@ -209,12 +209,12 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
// Restrictions based on previous section. // Restrictions based on previous section.
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg"); static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp"); static_assert(WARP_SIZE % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2"); static_assert(THREADS_PER_ROW == (THREADS_PER_ROW & -THREADS_PER_ROW), "THREADS_PER_ROW must be power of 2");
static_assert(THREADS_PER_ROW <= WARP_SIZE_PARAM, "THREADS_PER_ROW can be at most warp size"); static_assert(THREADS_PER_ROW <= WARP_SIZE, "THREADS_PER_ROW can be at most warp size");
// We have NUM_EXPERTS elements per row. We specialize for small #experts // We have NUM_EXPERTS elements per row. We specialize for small #experts
static constexpr int ELTS_PER_WARP = WARP_SIZE_PARAM * VPT; static constexpr int ELTS_PER_WARP = WARP_SIZE * VPT;
static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW; static constexpr int ROWS_PER_WARP = ELTS_PER_WARP / ELTS_PER_ROW;
static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP; static constexpr int ROWS_PER_CTA = WARPS_PER_CTA * ROWS_PER_WARP;
@ -393,51 +393,41 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
namespace detail namespace detail
{ {
// Constructs some constants needed to partition the work across threads at compile time. // Constructs some constants needed to partition the work across threads at compile time.
template <int EXPERTS, int BYTES_PER_LDG, int WARP_SIZE_PARAM> template <int EXPERTS, int BYTES_PER_LDG>
struct TopkConstants struct TopkConstants
{ {
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float); static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, ""); static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE) == 0, "");
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM)); static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE));
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG; static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
static constexpr int THREADS_PER_ROW = EXPERTS / VPT; static constexpr int THREADS_PER_ROW = EXPERTS / VPT;
static const int ROWS_PER_WARP = WARP_SIZE_PARAM / THREADS_PER_ROW; static constexpr int ROWS_PER_WARP = WARP_SIZE / THREADS_PER_ROW;
}; };
} // namespace detail } // namespace detail
template <int EXPERTS, int WARPS_PER_TB, int WARP_SIZE_PARAM, typename IndType> template <int EXPERTS, int WARPS_PER_TB, typename IndType>
void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices, void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices,
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream) int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream)
{ {
static constexpr std::size_t MAX_BYTES_PER_LDG = 16; static constexpr std::size_t MAX_BYTES_PER_LDG = 16;
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS); static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS);
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG, WARP_SIZE_PARAM>; using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG>;
static constexpr int VPT = Constants::VPT; static constexpr int VPT = Constants::VPT;
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP; static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP; const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB; const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB); dim3 block_dim(WARP_SIZE, WARPS_PER_TB);
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG, WARP_SIZE_PARAM><<<num_blocks, block_dim, 0, stream>>>( topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG><<<num_blocks, block_dim, 0, stream>>>(
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert); input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert);
} }
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \ #define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB) \
switch (warpSize) { \ topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB>( \
case 32: \ gating_output, nullptr, topk_weights, topk_indices, \
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 32>( \ token_expert_indices, num_tokens, topk, 0, num_experts, \
gating_output, nullptr, topk_weights, topk_indices, \ stream);
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
break; \
case 64: \
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 64>( \
gating_output, nullptr, topk_weights, topk_indices, \
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
break; \
default: \
TORCH_CHECK(false, "Unsupported warp size: ", warpSize); \
}
template <typename IndType> template <typename IndType>
void topkGatingSoftmaxKernelLauncher( void topkGatingSoftmaxKernelLauncher(
@ -451,7 +441,6 @@ void topkGatingSoftmaxKernelLauncher(
const int topk, const int topk,
cudaStream_t stream) { cudaStream_t stream) {
static constexpr int WARPS_PER_TB = 4; static constexpr int WARPS_PER_TB = 4;
auto warpSize = WARP_SIZE;
switch (num_experts) { switch (num_experts) {
case 1: case 1:
LAUNCH_SOFTMAX(1, WARPS_PER_TB); LAUNCH_SOFTMAX(1, WARPS_PER_TB);

View File

@ -56,17 +56,18 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
" -> Tensor"); " -> Tensor");
m.def( m.def(
"moe_permute(Tensor input, Tensor topk_ids," "moe_permute(Tensor input, Tensor topk_weight, Tensor! topk_ids,"
"Tensor token_expert_indices, Tensor? expert_map, int n_expert," "Tensor token_expert_indices, Tensor? expert_map, int n_expert,"
"int n_local_expert," "int n_local_expert,"
"int topk, int? align_block_size,Tensor! permuted_input, Tensor! " "int topk, int? align_block_size,Tensor! permuted_input, Tensor! "
"expert_first_token_offset, Tensor! inv_permuted_idx, Tensor! " "expert_first_token_offset, Tensor! src_row_id2dst_row_id_map, Tensor! "
"permuted_idx, Tensor! m_indices)->()"); "m_indices)->()");
m.def( m.def(
"moe_unpermute(Tensor permuted_hidden_states, Tensor topk_weights," "moe_unpermute(Tensor permuted_hidden_states, Tensor topk_weights,"
"Tensor inv_permuted_idx, Tensor? expert_first_token_offset, " "Tensor topk_ids,Tensor src_row_id2dst_row_id_map, Tensor "
"int topk, Tensor! hidden_states)->()"); "expert_first_token_offset, int n_expert, int n_local_expert,int "
"topk, Tensor! hidden_states)->()");
m.def("moe_permute_unpermute_supported() -> bool"); m.def("moe_permute_unpermute_supported() -> bool");
m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported); m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported);

View File

@ -167,6 +167,19 @@ void cutlass_mla_decode(torch::Tensor const& out, torch::Tensor const& q_nope,
torch::Tensor const& seq_lens, torch::Tensor const& seq_lens,
torch::Tensor const& page_table, double scale); torch::Tensor const& page_table, double scale);
void sm100_cutlass_mla_decode(
torch::Tensor const& out, torch::Tensor const& q_nope,
torch::Tensor const& q_pe, torch::Tensor const& kv_c_and_k_pe_cache,
torch::Tensor const& seq_lens, torch::Tensor const& page_table,
torch::Tensor const& workspace, double sm_scale,
int64_t num_kv_splits =
1 /* Set to 1 to avoid cuda_graph issue by default. */);
int64_t sm100_cutlass_mla_get_workspace_size(
int64_t max_seq_len, int64_t num_batches, int64_t sm_count = 0,
int64_t num_kv_splits =
1 /* Set to 1 to avoid cuda_graph issue by default. */);
torch::Tensor get_cuda_view_from_cpu_tensor(torch::Tensor& cpu_tensor); torch::Tensor get_cuda_view_from_cpu_tensor(torch::Tensor& cpu_tensor);
#ifndef USE_ROCM #ifndef USE_ROCM
@ -287,16 +300,6 @@ void scaled_fp4_experts_quant(
torch::Tensor const& input, torch::Tensor const& input_global_scale, torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts, torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts); torch::Tensor const& output_scale_offset_by_experts);
void per_token_group_quant_fp8(const torch::Tensor& input,
torch::Tensor& output_q, torch::Tensor& output_s,
int64_t group_size, double eps, double fp8_min,
double fp8_max, bool scale_ue8m0);
void per_token_group_quant_int8(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double int8_min, double int8_max);
#endif #endif
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,

View File

@ -4,7 +4,7 @@
#include <cmath> #include <cmath>
#include "core/math.hpp" #include "core/math.hpp"
#include "../cuda_compat.h" #include "cuda_compat.h"
#include "dispatch_utils.h" #include "dispatch_utils.h"
#include "quantization/fp8/common.cuh" #include "quantization/fp8/common.cuh"

View File

@ -1,8 +1,6 @@
#include <ATen/cuda/CUDAContext.h> #include <ATen/cuda/CUDAContext.h>
#include <torch/all.h> #include <torch/all.h>
#include "../per_token_group_quant_8bit.h"
#include <cmath> #include <cmath>
#include "../../dispatch_utils.h" #include "../../dispatch_utils.h"
@ -338,11 +336,3 @@ void dynamic_scaled_int8_quant(
} }
}); });
} }
void per_token_group_quant_int8(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double int8_min, double int8_max) {
per_token_group_quant_8bit(input, output_q, output_s, group_size, eps,
int8_min, int8_max);
}

View File

@ -21,49 +21,27 @@ struct sm90_fp8_config_default {
cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
using TileShape = cute::Shape<cute::_64, cute::_256, cute::_128>; using TileShape = cute::Shape<cute::_64, cute::_256, cute::_128>;
using ClusterShape = cute::Shape<cute::_1, cute::_2, cute::_1>; using ClusterShape = cute::Shape<cute::_1, cute::_2, cute::_1>;
using ArchTag = cutlass::arch::Sm90;
using Cutlass3xGemm = using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape, cutlass_3x_group_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
ClusterShape, KernelSchedule, EpilogueSchedule>; KernelSchedule, EpilogueSchedule>;
}; };
template <typename InType, typename OutType, template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue> template <typename, typename, typename> typename Epilogue>
struct sm90_fp8_config_M4 { struct sm90_fp8_config_M16 {
// M in [1, 4] // M in [1, 16]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>()); static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule = using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum; cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum;
using EpilogueSchedule = using EpilogueSchedule =
cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
using TileShape = cute::Shape<cute::_128, cute::_16, cute::_128>; using TileShape = cute::Shape<cute::_64, cute::_64, cute::_128>;
using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>; using ClusterShape = cute::Shape<cute::_1, cute::_4, cute::_1>;
using ArchTag = cutlass::arch::Sm90;
using Cutlass3xGemm = using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape, cutlass_3x_group_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
ClusterShape, KernelSchedule, EpilogueSchedule, KernelSchedule, EpilogueSchedule>;
true>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm90_fp8_config_M64 {
// M in (4, 64]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum;
using EpilogueSchedule =
cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
using TileShape = cute::Shape<cute::_128, cute::_16, cute::_256>;
using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
using ArchTag = cutlass::arch::Sm90;
using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>;
}; };
template <typename InType, typename OutType, template <typename InType, typename OutType,
@ -77,11 +55,10 @@ struct sm90_fp8_config_K8192 {
cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
using TileShape = cute::Shape<cute::_128, cute::_128, cute::_128>; using TileShape = cute::Shape<cute::_128, cute::_128, cute::_128>;
using ClusterShape = cute::Shape<cute::_1, cute::_8, cute::_1>; using ClusterShape = cute::Shape<cute::_1, cute::_8, cute::_1>;
using ArchTag = cutlass::arch::Sm90;
using Cutlass3xGemm = using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape, cutlass_3x_group_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
ClusterShape, KernelSchedule, EpilogueSchedule>; KernelSchedule, EpilogueSchedule>;
}; };
template <typename InType, typename OutType, template <typename InType, typename OutType,
@ -95,11 +72,10 @@ struct sm90_fp8_config_N8192 {
cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong; cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
using TileShape = cute::Shape<cute::_64, cute::_128, cute::_256>; using TileShape = cute::Shape<cute::_64, cute::_128, cute::_256>;
using ClusterShape = cute::Shape<cute::_1, cute::_8, cute::_1>; using ClusterShape = cute::Shape<cute::_1, cute::_8, cute::_1>;
using ArchTag = cutlass::arch::Sm90;
using Cutlass3xGemm = using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape, cutlass_3x_group_gemm<InType, OutType, Epilogue, TileShape, ClusterShape,
ClusterShape, KernelSchedule, EpilogueSchedule>; KernelSchedule, EpilogueSchedule>;
}; };
template <typename InType, typename OutType> template <typename InType, typename OutType>
@ -119,13 +95,14 @@ void run_cutlass_moe_mm_sm90(
TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn, TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn,
"B tensors must be of type float8_e4m3fn."); "B tensors must be of type float8_e4m3fn.");
TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);
using Cutlass3xGemmN8192 = typename sm90_fp8_config_N8192< using Cutlass3xGemmN8192 = typename sm90_fp8_config_N8192<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmK8192 = typename sm90_fp8_config_K8192< using Cutlass3xGemmK8192 = typename sm90_fp8_config_K8192<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmM4 = typename sm90_fp8_config_M4< using Cutlass3xGemmM16 = typename sm90_fp8_config_M16<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmM64 = typename sm90_fp8_config_M64<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmDefault = typename sm90_fp8_config_default< using Cutlass3xGemmDefault = typename sm90_fp8_config_default<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm; InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
@ -134,18 +111,7 @@ void run_cutlass_moe_mm_sm90(
uint32_t const n = out_tensors.size(1); uint32_t const n = out_tensors.size(1);
uint32_t const k = a_tensors.size(1); uint32_t const k = a_tensors.size(1);
// Use swap_ab for M <= 64 by default to reduce padding if (n >= 8192) {
if (m <= 4) {
cutlass_group_gemm_caller<Cutlass3xGemmM4>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else if (m <= 64) {
cutlass_group_gemm_caller<Cutlass3xGemmM64>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else if (n >= 8192) {
cutlass_group_gemm_caller<Cutlass3xGemmN8192>( cutlass_group_gemm_caller<Cutlass3xGemmN8192>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token, problem_sizes, a_strides, b_strides, c_strides, per_act_token,
@ -155,6 +121,11 @@ void run_cutlass_moe_mm_sm90(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token, problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch); per_out_ch);
} else if (m <= 16) {
cutlass_group_gemm_caller<Cutlass3xGemmM16>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else { } else {
cutlass_group_gemm_caller<Cutlass3xGemmDefault>( cutlass_group_gemm_caller<Cutlass3xGemmDefault>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,

View File

@ -18,34 +18,28 @@ using ProblemShape =
cutlass::gemm::GroupProblemShape<cute::Shape<int, int, int>>; cutlass::gemm::GroupProblemShape<cute::Shape<int, int, int>>;
using ElementAccumulator = float; using ElementAccumulator = float;
using ArchTag = cutlass::arch::Sm90;
using OperatorClass = cutlass::arch::OpClassTensorOp; using OperatorClass = cutlass::arch::OpClassTensorOp;
using LayoutA = cutlass::layout::RowMajor; using LayoutA = cutlass::layout::RowMajor;
using LayoutA_Transpose =
typename cutlass::layout::LayoutTranspose<LayoutA>::type;
using LayoutB = cutlass::layout::ColumnMajor; using LayoutB = cutlass::layout::ColumnMajor;
using LayoutB_Transpose = using LayoutC = cutlass::layout::RowMajor;
typename cutlass::layout::LayoutTranspose<LayoutB>::type;
using LayoutD = cutlass::layout::RowMajor;
using LayoutD_Transpose =
typename cutlass::layout::LayoutTranspose<LayoutD>::type;
using LayoutC = LayoutD;
using LayoutC_Transpose = LayoutD_Transpose;
template <typename ElementAB_, typename ElementC_, typename ArchTag_, template <typename ElementAB_, typename ElementC_,
template <typename, typename, typename> typename Epilogue_, template <typename, typename, typename> typename Epilogue_,
typename TileShape, typename ClusterShape, typename KernelSchedule, typename TileShape, typename ClusterShape, typename KernelSchedule,
typename EpilogueSchedule, bool swap_ab_ = false> typename EpilogueSchedule>
struct cutlass_3x_group_gemm { struct cutlass_3x_group_gemm {
static constexpr bool swap_ab = swap_ab_;
using ElementAB = ElementAB_; using ElementAB = ElementAB_;
using ElementC = void; using ElementC = void;
using ElementD = ElementC_; using ElementD = ElementC_;
using ElementAccumulator = float; using ElementAccumulator = float;
using ArchTag = ArchTag_;
using Epilogue = Epilogue_<ElementAccumulator, ElementD, TileShape>; using Epilogue = Epilogue_<ElementAccumulator, ElementD, TileShape>;
using StrideC =
cute::remove_pointer_t<cute::Stride<int64_t, cute::Int<1>, cute::Int<0>>>;
static constexpr int AlignmentAB = static constexpr int AlignmentAB =
128 / cutlass::sizeof_bits<ElementAB>::value; 128 / cutlass::sizeof_bits<ElementAB>::value;
static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementD>::value; static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementD>::value;
@ -56,28 +50,21 @@ struct cutlass_3x_group_gemm {
typename cutlass::epilogue::collective::CollectiveBuilder< typename cutlass::epilogue::collective::CollectiveBuilder<
ArchTag, OperatorClass, TileShape, ClusterShape, ArchTag, OperatorClass, TileShape, ClusterShape,
cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator, cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator,
ElementAccumulator, ElementC, ElementAccumulator, ElementC, LayoutC*, AlignmentC, ElementD,
conditional_t<swap_ab, LayoutC_Transpose*, LayoutC*>, AlignmentC, LayoutC*, AlignmentC, EpilogueSchedule, EVTCompute>::CollectiveOp;
ElementD, conditional_t<swap_ab, LayoutD_Transpose*, LayoutD*>,
AlignmentC, EpilogueSchedule, EVTCompute>::CollectiveOp;
static constexpr size_t CEStorageSize = static constexpr size_t CEStorageSize =
sizeof(typename CollectiveEpilogue::SharedStorage); sizeof(typename CollectiveEpilogue::SharedStorage);
using Stages = typename cutlass::gemm::collective::StageCountAutoCarveout< using Stages = typename cutlass::gemm::collective::StageCountAutoCarveout<
static_cast<int>(CEStorageSize)>; static_cast<int>(CEStorageSize)>;
using CollectiveMainloop = conditional_t< using CollectiveMainloop =
swap_ab,
typename cutlass::gemm::collective::CollectiveBuilder<
ArchTag, OperatorClass, ElementAB, LayoutB_Transpose*, AlignmentAB,
ElementAB, LayoutA_Transpose*, AlignmentAB, ElementAccumulator,
TileShape, ClusterShape, Stages, KernelSchedule>::CollectiveOp,
typename cutlass::gemm::collective::CollectiveBuilder< typename cutlass::gemm::collective::CollectiveBuilder<
ArchTag, OperatorClass, ElementAB, LayoutA*, AlignmentAB, ElementAB, ArchTag, OperatorClass, ElementAB, LayoutA*, AlignmentAB, ElementAB,
LayoutB*, AlignmentAB, ElementAccumulator, TileShape, ClusterShape, LayoutB*, AlignmentAB, ElementAccumulator, TileShape, ClusterShape,
Stages, KernelSchedule>::CollectiveOp>; Stages, KernelSchedule>::CollectiveOp;
using KernelType = enable_sm90_or_later<cutlass::gemm::kernel::GemmUniversal< using KernelType = enable_sm90_only<cutlass::gemm::kernel::GemmUniversal<
ProblemShape, CollectiveMainloop, CollectiveEpilogue>>; ProblemShape, CollectiveMainloop, CollectiveEpilogue>>;
struct GemmKernel : public KernelType {}; struct GemmKernel : public KernelType {};
@ -91,12 +78,12 @@ void cutlass_group_gemm_caller(
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides, torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides, torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch) { bool per_act_token, bool per_out_ch) {
static constexpr bool swap_ab = Gemm::swap_ab;
using ElementAB = typename Gemm::ElementAB; using ElementAB = typename Gemm::ElementAB;
using ElementD = typename Gemm::ElementD; using ElementD = typename Gemm::ElementD;
int num_experts = static_cast<int>(expert_offsets.size(0)); int num_experts = static_cast<int>(expert_offsets.size(0));
int k_size = a_tensors.size(1);
int n_size = out_tensors.size(1);
auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index()); auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index());
@ -123,47 +110,26 @@ void cutlass_group_gemm_caller(
problem_sizes.data_ptr()); problem_sizes.data_ptr());
ProblemShape prob_shape{num_experts, problem_sizes_as_shapes, nullptr}; ProblemShape prob_shape{num_experts, problem_sizes_as_shapes, nullptr};
typename GemmKernel::MainloopArguments mainloop_args; typename GemmKernel::MainloopArguments mainloop_args{
if constexpr (swap_ab) { static_cast<const ElementAB**>(a_ptrs.data_ptr()),
mainloop_args = typename GemmKernel::MainloopArguments{ static_cast<StrideA*>(a_strides.data_ptr()),
static_cast<const ElementAB**>(b_ptrs.data_ptr()), static_cast<const ElementAB**>(b_ptrs.data_ptr()),
static_cast<StrideB*>(b_strides.data_ptr()), static_cast<StrideB*>(b_strides.data_ptr())};
static_cast<const ElementAB**>(a_ptrs.data_ptr()),
static_cast<StrideA*>(a_strides.data_ptr())};
} else {
mainloop_args = typename GemmKernel::MainloopArguments{
static_cast<const ElementAB**>(a_ptrs.data_ptr()),
static_cast<StrideA*>(a_strides.data_ptr()),
static_cast<const ElementAB**>(b_ptrs.data_ptr()),
static_cast<StrideB*>(b_strides.data_ptr())};
}
// Currently, we are only able to do broadcast on either all or none a_scales // Currently, we are only able to do broadcast on either all or none a_scales
// and on either all or none b_scales // and on either all or none b_scales
typename GemmKernel::EpilogueArguments epilogue_args{ typename GemmKernel::EpilogueArguments epilogue_args{
Gemm::Epilogue::prepare_args( Gemm::Epilogue::prepare_args(
swap_ab ? static_cast<const ElementAccumulator**>( static_cast<const ElementAccumulator**>(a_scales_ptrs.data_ptr()),
b_scales_ptrs.data_ptr()) static_cast<const ElementAccumulator**>(b_scales_ptrs.data_ptr()),
: static_cast<const ElementAccumulator**>( per_act_token, per_out_ch),
a_scales_ptrs.data_ptr()),
swap_ab ? static_cast<const ElementAccumulator**>(
a_scales_ptrs.data_ptr())
: static_cast<const ElementAccumulator**>(
b_scales_ptrs.data_ptr()),
swap_ab ? per_out_ch : per_act_token,
swap_ab ? per_act_token : per_out_ch),
nullptr, static_cast<StrideC*>(c_strides.data_ptr()), nullptr, static_cast<StrideC*>(c_strides.data_ptr()),
static_cast<ElementD**>(out_ptrs.data_ptr()), static_cast<ElementD**>(out_ptrs.data_ptr()),
static_cast<StrideC*>(c_strides.data_ptr())}; static_cast<StrideC*>(c_strides.data_ptr())};
int device_id = a_tensors.device().index();
static const cutlass::KernelHardwareInfo hw_info{
device_id, cutlass::KernelHardwareInfo::query_device_multiprocessor_count(
device_id)};
typename GemmKernel::Arguments args{ typename GemmKernel::Arguments args{
cutlass::gemm::GemmUniversalMode::kGrouped, prob_shape, mainloop_args, cutlass::gemm::GemmUniversalMode::kGrouped, prob_shape, mainloop_args,
epilogue_args, hw_info}; epilogue_args};
using GemmOp = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>; using GemmOp = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
GemmOp gemm_op; GemmOp gemm_op;

View File

@ -1,140 +0,0 @@
#include <cudaTypedefs.h>
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "cutlass/cutlass.h"
#include "grouped_mm_c3x.cuh"
using namespace cute;
namespace {
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm100_fp8_config_default {
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100;
using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
using TileShape = cute::Shape<cute::_128, cute::_256, cute::_128>;
using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
using ArchTag = cutlass::arch::Sm100;
using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm100_fp8_config_M64 {
// M in [1,64]
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecialized1SmSm100;
using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
using TileShape = cute::Shape<cute::_128, cute::_16, cute::_128>;
using ClusterShape = cute::Shape<cute::_1, cute::_1, cute::_1>;
using ArchTag = cutlass::arch::Sm100;
using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule,
true>;
};
template <typename InType, typename OutType,
template <typename, typename, typename> typename Epilogue>
struct sm100_fp8_config_N8192 {
// N in [8192, inf)
static_assert(std::is_same<InType, cutlass::float_e4m3_t>());
using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecialized2SmSm100;
using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized2Sm;
using TileShape = cute::Shape<cute::_128, cute::_256, cute::_128>;
using ClusterShape = cute::Shape<cute::_2, cute::_1, cute::_1>;
using ArchTag = cutlass::arch::Sm100;
using Cutlass3xGemm =
cutlass_3x_group_gemm<InType, OutType, ArchTag, Epilogue, TileShape,
ClusterShape, KernelSchedule, EpilogueSchedule>;
};
template <typename InType, typename OutType>
void run_cutlass_moe_mm_sm100(
torch::Tensor& out_tensors, torch::Tensor const& a_tensors,
torch::Tensor const& b_tensors, torch::Tensor const& a_scales,
torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch) {
TORCH_CHECK(a_tensors.size(0) > 0, "No input A tensors provided.");
TORCH_CHECK(b_tensors.size(0) > 0, "No input B tensors provided.");
TORCH_CHECK(out_tensors.size(0) > 0, "No output tensors provided.");
TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn,
"A tensors must be of type float8_e4m3fn.");
TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn,
"B tensors must be of type float8_e4m3fn.");
using Cutlass3xGemmDefault = typename sm100_fp8_config_default<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmN8192 = typename sm100_fp8_config_N8192<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
using Cutlass3xGemmM64 = typename sm100_fp8_config_M64<
InType, OutType, vllm::c3x::ScaledEpilogueArray>::Cutlass3xGemm;
uint32_t const m = a_tensors.size(0);
uint32_t const n = out_tensors.size(1);
if (m <= 64) {
cutlass_group_gemm_caller<Cutlass3xGemmM64>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else if (n >= 8192) {
cutlass_group_gemm_caller<Cutlass3xGemmN8192>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else {
cutlass_group_gemm_caller<Cutlass3xGemmDefault>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
}
}
} // namespace
void dispatch_moe_mm_sm100(
torch::Tensor& out_tensors, torch::Tensor const& a_tensors,
torch::Tensor const& b_tensors, torch::Tensor const& a_scales,
torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch) {
if (out_tensors.dtype() == torch::kBFloat16) {
run_cutlass_moe_mm_sm100<cutlass::float_e4m3_t, cutlass::bfloat16_t>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
} else {
run_cutlass_moe_mm_sm100<cutlass::float_e4m3_t, cutlass::half_t>(
out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets,
problem_sizes, a_strides, b_strides, c_strides, per_act_token,
per_out_ch);
}
}
void cutlass_moe_mm_sm100(
torch::Tensor& out_tensors, torch::Tensor const& a_tensors,
torch::Tensor const& b_tensors, torch::Tensor const& a_scales,
torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch) {
dispatch_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales,
expert_offsets, problem_sizes, a_strides, b_strides,
c_strides, per_act_token, per_out_ch);
}

View File

@ -6,10 +6,7 @@
#include <iostream> #include <iostream>
constexpr uint64_t THREADS_PER_EXPERT = 512; constexpr uint64_t THREADS_PER_EXPERT = 512;
// threshold must match the dispatch logic in run_cutlass_moe_mm_sm90()
constexpr int SWAP_AB_THRESHOLD = 64;
template <bool SWAP_AB>
__global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids, __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids,
int32_t* problem_sizes1, int32_t* problem_sizes1,
int32_t* problem_sizes2, int32_t* problem_sizes2,
@ -27,51 +24,40 @@ __global__ void compute_problem_sizes(const int32_t* __restrict__ topk_ids,
if (threadIdx.x == 0) { if (threadIdx.x == 0) {
int final_occurrences = atomic_buffer[expert_id]; int final_occurrences = atomic_buffer[expert_id];
if constexpr (!SWAP_AB) { problem_sizes1[expert_id * 3] = final_occurrences;
problem_sizes1[expert_id * 3] = final_occurrences; problem_sizes1[expert_id * 3 + 1] = 2 * n;
problem_sizes1[expert_id * 3 + 1] = 2 * n; problem_sizes1[expert_id * 3 + 2] = k;
problem_sizes1[expert_id * 3 + 2] = k; problem_sizes2[expert_id * 3] = final_occurrences;
problem_sizes2[expert_id * 3] = final_occurrences; problem_sizes2[expert_id * 3 + 1] = k;
problem_sizes2[expert_id * 3 + 1] = k; problem_sizes2[expert_id * 3 + 2] = n;
problem_sizes2[expert_id * 3 + 2] = n;
} else {
problem_sizes1[expert_id * 3] = 2 * n;
problem_sizes1[expert_id * 3 + 1] = final_occurrences;
problem_sizes1[expert_id * 3 + 2] = k;
problem_sizes2[expert_id * 3] = k;
problem_sizes2[expert_id * 3 + 1] = final_occurrences;
problem_sizes2[expert_id * 3 + 2] = n;
}
} }
} }
__global__ void compute_expert_offsets( __global__ void compute_expert_offsets(
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
int32_t* atomic_buffer, const int num_experts, const bool swap_ab) { int32_t* atomic_buffer, const int num_experts) {
int32_t tot_offset = 0; int32_t tot_offset = 0;
expert_offsets[0] = 0; expert_offsets[0] = 0;
for (int i = 0; i < num_experts; ++i) { for (int i = 0; i < num_experts; ++i) {
atomic_buffer[i] = tot_offset; atomic_buffer[i] = tot_offset;
tot_offset += swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3]; tot_offset += problem_sizes1[i * 3];
expert_offsets[i + 1] = tot_offset; expert_offsets[i + 1] = tot_offset;
} }
} }
__global__ void compute_expert_blockscale_offsets( __global__ void compute_expert_blockscale_offsets(
const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets, const int32_t* __restrict__ problem_sizes1, int32_t* expert_offsets,
int32_t* blockscale_offsets, int32_t* atomic_buffer, const int num_experts, int32_t* blockscale_offsets, int32_t* atomic_buffer,
const bool swap_ab) { const int num_experts) {
int32_t tot_offset = 0; int32_t tot_offset = 0;
int32_t tot_offset_round = 0; int32_t tot_offset_round = 0;
expert_offsets[0] = 0; expert_offsets[0] = 0;
blockscale_offsets[0] = 0; blockscale_offsets[0] = 0;
for (int i = 0; i < num_experts; ++i) { for (int i = 0; i < num_experts; ++i) {
int32_t cur_offset =
swap_ab ? problem_sizes1[i * 3 + 1] : problem_sizes1[i * 3];
atomic_buffer[i] = tot_offset; atomic_buffer[i] = tot_offset;
tot_offset += cur_offset; tot_offset += problem_sizes1[i * 3];
expert_offsets[i + 1] = tot_offset; expert_offsets[i + 1] = tot_offset;
tot_offset_round += (cur_offset + (128 - 1)) / 128 * 128; tot_offset_round += (problem_sizes1[i * 3] + (128 - 1)) / 128 * 128;
blockscale_offsets[i + 1] = tot_offset_round; blockscale_offsets[i + 1] = tot_offset_round;
} }
} }
@ -116,41 +102,22 @@ void get_cutlass_moe_mm_data_caller(
torch::Tensor atomic_buffer = torch::zeros(num_experts, options_int32); torch::Tensor atomic_buffer = torch::zeros(num_experts, options_int32);
int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel()); int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
compute_problem_sizes<<<num_experts, num_threads, 0, stream>>>(
// Swap-AB should be disabled for FP4 path static_cast<const int32_t*>(topk_ids.data_ptr()),
bool may_swap_ab = (!blockscale_offsets.has_value()) && static_cast<int32_t*>(problem_sizes1.data_ptr()),
(topk_ids.numel() <= SWAP_AB_THRESHOLD); static_cast<int32_t*>(problem_sizes2.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n, k);
if (may_swap_ab) {
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()),
static_cast<int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(problem_sizes2.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n,
k);
} else {
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()),
static_cast<int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(problem_sizes2.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), topk_ids.numel(), n,
k);
}
if (blockscale_offsets.has_value()) { if (blockscale_offsets.has_value()) {
// fp4 path
compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>( compute_expert_blockscale_offsets<<<1, 1, 0, stream>>>(
static_cast<const int32_t*>(problem_sizes1.data_ptr()), static_cast<const int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(expert_offsets.data_ptr()), static_cast<int32_t*>(expert_offsets.data_ptr()),
static_cast<int32_t*>(blockscale_offsets.value().data_ptr()), static_cast<int32_t*>(blockscale_offsets.value().data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts, static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts);
may_swap_ab);
} else { } else {
compute_expert_offsets<<<1, 1, 0, stream>>>( compute_expert_offsets<<<1, 1, 0, stream>>>(
static_cast<const int32_t*>(problem_sizes1.data_ptr()), static_cast<const int32_t*>(problem_sizes1.data_ptr()),
static_cast<int32_t*>(expert_offsets.data_ptr()), static_cast<int32_t*>(expert_offsets.data_ptr()),
static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts, static_cast<int32_t*>(atomic_buffer.data_ptr()), num_experts);
may_swap_ab);
} }
compute_arg_sorts<<<num_experts, num_threads, 0, stream>>>( compute_arg_sorts<<<num_experts, num_threads, 0, stream>>>(
static_cast<const int32_t*>(topk_ids.data_ptr()), static_cast<const int32_t*>(topk_ids.data_ptr()),
@ -193,4 +160,4 @@ void get_cutlass_pplx_moe_mm_data_caller(torch::Tensor& expert_offsets,
static_cast<int32_t*>(problem_sizes2.data_ptr()), static_cast<int32_t*>(problem_sizes2.data_ptr()),
static_cast<const int32_t*>(expert_num_tokens.data_ptr()), padded_m, n, static_cast<const int32_t*>(expert_num_tokens.data_ptr()), padded_m, n,
k); k);
} }

View File

@ -41,16 +41,6 @@ void cutlass_moe_mm_sm90(
#endif #endif
#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100
void cutlass_moe_mm_sm100(
torch::Tensor& out_tensors, torch::Tensor const& a_tensors,
torch::Tensor const& b_tensors, torch::Tensor const& a_scales,
torch::Tensor const& b_scales, torch::Tensor const& expert_offsets,
torch::Tensor const& problem_sizes, torch::Tensor const& a_strides,
torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch);
#endif
#if defined ENABLE_SCALED_MM_SM120 && ENABLE_SCALED_MM_SM120 #if defined ENABLE_SCALED_MM_SM120 && ENABLE_SCALED_MM_SM120
void cutlass_scaled_mm_sm120(torch::Tensor& c, torch::Tensor const& a, void cutlass_scaled_mm_sm120(torch::Tensor& c, torch::Tensor const& a,
torch::Tensor const& b, torch::Tensor const& b,
@ -140,10 +130,10 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
// and at least SM90 (Hopper) // and at least SM90 (Hopper)
#if defined CUDA_VERSION #if defined CUDA_VERSION
if (cuda_device_capability >= 100) { if (cuda_device_capability >= 90 && cuda_device_capability < 100) {
return CUDA_VERSION >= 12080;
} else if (cuda_device_capability >= 90) {
return CUDA_VERSION >= 12000; return CUDA_VERSION >= 12000;
} else if (cuda_device_capability >= 100) {
return CUDA_VERSION >= 12080;
} }
#endif #endif
@ -151,14 +141,11 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
} }
bool cutlass_group_gemm_supported(int64_t cuda_device_capability) { bool cutlass_group_gemm_supported(int64_t cuda_device_capability) {
// CUTLASS grouped FP8 kernels need at least CUDA 12.3 and SM90 (Hopper) // CUTLASS grouped FP8 kernels need at least CUDA 12.3
// or CUDA 12.8 and SM100 (Blackwell) // and SM90 (Hopper)
#if defined CUDA_VERSION #if defined CUDA_VERSION
if (cuda_device_capability >= 100) { if (cuda_device_capability == 90) {
return CUDA_VERSION >= 12080;
}
if (cuda_device_capability >= 90) {
return CUDA_VERSION >= 12030; return CUDA_VERSION >= 12030;
} }
#endif #endif
@ -247,26 +234,16 @@ void cutlass_moe_mm(
torch::Tensor const& b_strides, torch::Tensor const& c_strides, torch::Tensor const& b_strides, torch::Tensor const& c_strides,
bool per_act_token, bool per_out_ch) { bool per_act_token, bool per_out_ch) {
int32_t version_num = get_sm_version_num(); int32_t version_num = get_sm_version_num();
#if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100
if (version_num >= 100) {
cutlass_moe_mm_sm100(out_tensors, a_tensors, b_tensors, a_scales, b_scales,
expert_offsets, problem_sizes, a_strides, b_strides,
c_strides, per_act_token, per_out_ch);
return;
}
#endif
#if defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90 #if defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90
if (version_num >= 90) { cutlass_moe_mm_sm90(out_tensors, a_tensors, b_tensors, a_scales, b_scales,
cutlass_moe_mm_sm90(out_tensors, a_tensors, b_tensors, a_scales, b_scales, expert_offsets, problem_sizes, a_strides, b_strides,
expert_offsets, problem_sizes, a_strides, b_strides, c_strides, per_act_token, per_out_ch);
c_strides, per_act_token, per_out_ch); return;
return;
}
#endif #endif
TORCH_CHECK_NOT_IMPLEMENTED( TORCH_CHECK_NOT_IMPLEMENTED(
false, false,
"No compiled cutlass_scaled_mm for CUDA device capability: ", version_num, "No compiled cutlass_scaled_mm for CUDA device capability: ", version_num,
". Required capability: 90 or 100"); ". Required capability: 90");
} }
void get_cutlass_moe_mm_data( void get_cutlass_moe_mm_data(

View File

@ -88,8 +88,6 @@ void static_scaled_fp8_quant(torch::Tensor& out, // [..., d]
torch::Tensor const& input, // [..., d] torch::Tensor const& input, // [..., d]
torch::Tensor const& scale) // [1] torch::Tensor const& scale) // [1]
{ {
TORCH_CHECK(input.is_contiguous());
TORCH_CHECK(out.is_contiguous());
int const block_size = 256; int const block_size = 256;
int const num_tokens = input.numel() / input.size(-1); int const num_tokens = input.numel() / input.size(-1);
int const num_elems = input.numel(); int const num_elems = input.numel();
@ -113,8 +111,6 @@ void dynamic_scaled_fp8_quant(torch::Tensor& out, // [..., d]
torch::Tensor const& input, // [..., d] torch::Tensor const& input, // [..., d]
torch::Tensor& scale) // [1] torch::Tensor& scale) // [1]
{ {
TORCH_CHECK(input.is_contiguous());
TORCH_CHECK(out.is_contiguous());
int const block_size = 256; int const block_size = 256;
int const num_tokens = input.numel() / input.size(-1); int const num_tokens = input.numel() / input.size(-1);
int const num_elems = input.numel(); int const num_elems = input.numel();

View File

@ -1,217 +0,0 @@
#include <ATen/cuda/CUDAContext.h>
#include <c10/util/Float8_e4m3fn.h>
#include "../per_token_group_quant_8bit.h"
#include <cmath>
#include <cuda_fp16.h>
#include <cuda_bf16.h>
#include <torch/all.h>
#include "../vectorization.cuh"
#include "../vectorization_utils.cuh"
#include "../../dispatch_utils.h"
__device__ __forceinline__ float GroupReduceMax(float val, const int tid) {
unsigned mask = 0xffff;
val = fmaxf(val, __shfl_xor_sync(mask, val, 8));
val = fmaxf(val, __shfl_xor_sync(mask, val, 4));
val = fmaxf(val, __shfl_xor_sync(mask, val, 2));
val = fmaxf(val, __shfl_xor_sync(mask, val, 1));
return val;
}
template <typename T, typename DST_DTYPE, bool IS_COLUMN_MAJOR = false,
bool SCALE_UE8M0 = false, typename scale_packed_t = float>
__global__ void per_token_group_quant_8bit_kernel(
const T* __restrict__ input, void* __restrict__ output_q,
scale_packed_t* __restrict__ output_s, const int group_size,
const int num_groups, const int groups_per_block, const float eps,
const float min_8bit, const float max_8bit, const int scale_num_rows = 0,
const int scale_stride = 0) {
const int threads_per_group = 16;
const int64_t local_group_id = threadIdx.x / threads_per_group;
const int lane_id = threadIdx.x % threads_per_group;
const int64_t block_group_id = blockIdx.x * groups_per_block;
const int64_t global_group_id = block_group_id + local_group_id;
const int64_t block_group_offset = global_group_id * group_size;
float local_absmax = eps;
using scale_element_t = float;
static_assert(sizeof(scale_packed_t) % sizeof(scale_element_t) == 0);
const T* group_input = input + block_group_offset;
DST_DTYPE* group_output =
static_cast<DST_DTYPE*>(output_q) + block_group_offset;
scale_element_t* scale_output;
if constexpr (IS_COLUMN_MAJOR) {
const int num_elems_per_pack =
static_cast<int>(sizeof(scale_packed_t) / sizeof(scale_element_t));
const int scale_num_rows_element = scale_num_rows * num_elems_per_pack;
const int row_idx = global_group_id / scale_num_rows_element;
const int col_idx_raw = global_group_id % scale_num_rows_element;
const int col_idx = col_idx_raw / num_elems_per_pack;
const int pack_idx = col_idx_raw % num_elems_per_pack;
scale_output = reinterpret_cast<scale_element_t*>(output_s) +
(col_idx * scale_stride * num_elems_per_pack +
row_idx * num_elems_per_pack + pack_idx);
} else {
scale_output = output_s + global_group_id;
}
// shared memory to cache each group's data to avoid double DRAM reads.
extern __shared__ __align__(16) char smem_raw[];
T* smem = reinterpret_cast<T*>(smem_raw);
T* smem_group = smem + local_group_id * group_size;
constexpr int vec_size = 16 / sizeof(T);
using vec_t = vllm::vec_n_t<T, vec_size>;
// copy global -> shared & compute absmax
auto scalar_op_cache = [&] __device__(T & dst, const T& src) {
float abs_v = fabsf(static_cast<float>(src));
local_absmax = fmaxf(local_absmax, abs_v);
dst = src;
};
vllm::vectorize_with_alignment<vec_size>(
group_input, // in
smem_group, // out (shared)
group_size, // elements per group
lane_id, // thread id
threads_per_group, // stride in group
scalar_op_cache); // scalar handler
local_absmax = GroupReduceMax(local_absmax, lane_id);
float y_s = local_absmax / max_8bit;
if constexpr (SCALE_UE8M0) {
y_s = exp2f(ceilf(log2f(fmaxf(fabsf(y_s), 1e-10f))));
}
scale_element_t y_s_quant = y_s;
if (lane_id == 0) {
*scale_output = y_s_quant;
}
__syncthreads();
// quantize shared -> global 8-bit
auto scalar_op_quant = [&] __device__(DST_DTYPE & dst, const T& src) {
float q = fminf(fmaxf(static_cast<float>(src) / y_s, min_8bit), max_8bit);
dst = DST_DTYPE(q);
};
vllm::vectorize_with_alignment<vec_size>(
smem_group, // in (shared)
group_output, // out (global quant tensor)
group_size, // elements
lane_id, // tid
threads_per_group, // stride
scalar_op_quant); // scalar handler
}
void per_token_group_quant_8bit(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double min_8bit, double max_8bit,
bool scale_ue8m0) {
TORCH_CHECK(input.is_contiguous());
TORCH_CHECK(output_q.is_contiguous());
const int num_groups = input.numel() / group_size;
TORCH_CHECK(input.numel() % group_size == 0);
TORCH_CHECK(output_s.dim() == 2);
cudaStream_t stream = at::cuda::getCurrentCUDAStream();
constexpr int THREADS_PER_GROUP = 16;
int groups_per_block = 1;
if (num_groups % 16 == 0) {
groups_per_block = 16;
} else if (num_groups % 8 == 0) {
groups_per_block = 8;
} else if (num_groups % 4 == 0) {
groups_per_block = 4;
} else if (num_groups % 2 == 0) {
groups_per_block = 2;
}
auto dst_type = output_q.scalar_type();
const int num_blocks = num_groups / groups_per_block;
const int num_threads = groups_per_block * THREADS_PER_GROUP;
const bool is_column_major = output_s.stride(0) < output_s.stride(1);
const int scale_num_rows = output_s.size(1);
const int scale_stride = output_s.stride(1);
#define LAUNCH_KERNEL(T, DST_DTYPE) \
do { \
dim3 grid(num_blocks); \
dim3 block(num_threads); \
size_t smem_bytes = \
static_cast<size_t>(groups_per_block) * group_size * sizeof(T); \
if (is_column_major) { \
if (scale_ue8m0) { \
per_token_group_quant_8bit_kernel<T, DST_DTYPE, true, true> \
<<<grid, block, smem_bytes, stream>>>( \
static_cast<T*>(input.data_ptr()), output_q.data_ptr(), \
static_cast<float*>(output_s.data_ptr()), group_size, \
num_groups, groups_per_block, (float)eps, (float)min_8bit, \
(float)max_8bit, scale_num_rows, scale_stride); \
} else { \
per_token_group_quant_8bit_kernel<T, DST_DTYPE, true, false> \
<<<grid, block, smem_bytes, stream>>>( \
static_cast<T*>(input.data_ptr()), output_q.data_ptr(), \
static_cast<float*>(output_s.data_ptr()), group_size, \
num_groups, groups_per_block, (float)eps, (float)min_8bit, \
(float)max_8bit, scale_num_rows, scale_stride); \
} \
} else { \
if (scale_ue8m0) { \
per_token_group_quant_8bit_kernel<T, DST_DTYPE, false, true> \
<<<grid, block, smem_bytes, stream>>>( \
static_cast<T*>(input.data_ptr()), output_q.data_ptr(), \
static_cast<float*>(output_s.data_ptr()), group_size, \
num_groups, groups_per_block, (float)eps, (float)min_8bit, \
(float)max_8bit); \
} else { \
per_token_group_quant_8bit_kernel<T, DST_DTYPE, false, false> \
<<<grid, block, smem_bytes, stream>>>( \
static_cast<T*>(input.data_ptr()), output_q.data_ptr(), \
static_cast<float*>(output_s.data_ptr()), group_size, \
num_groups, groups_per_block, (float)eps, (float)min_8bit, \
(float)max_8bit); \
} \
} \
} while (0)
VLLM_DISPATCH_FLOATING_TYPES(
input.scalar_type(), "per_token_group_quant_8bit", ([&] {
if (dst_type == at::ScalarType::Float8_e4m3fn) {
LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn);
} else if (dst_type == at::ScalarType::Char) {
LAUNCH_KERNEL(scalar_t, int8_t);
}
}));
#undef LAUNCH_KERNEL
}
void per_token_group_quant_fp8(const torch::Tensor& input,
torch::Tensor& output_q, torch::Tensor& output_s,
int64_t group_size, double eps, double fp8_min,
double fp8_max, bool scale_ue8m0) {
per_token_group_quant_8bit(input, output_q, output_s, group_size, eps,
fp8_min, fp8_max, scale_ue8m0);
}

View File

@ -4,7 +4,7 @@
#include <torch/all.h> #include <torch/all.h>
#include <c10/cuda/CUDAGuard.h> #include <c10/cuda/CUDAGuard.h>
#include "../../cuda_compat.h" #include "cuda_compat.h"
#include "dispatch_utils.h" #include "dispatch_utils.h"
#include "ggml-common.h" #include "ggml-common.h"

View File

@ -187,12 +187,8 @@ struct PrepackedLayoutBTemplate {
CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy( CUTE_HOST_DEVICE static constexpr auto TVbNbKL_to_offset_copy(
Shape_NKL shape_mkl) { Shape_NKL shape_mkl) {
auto layout = TVbNbKL_to_offset(shape_mkl); auto layout = TVbNbKL_to_offset(shape_mkl);
// for 4-bit elements, having >= 64 values per column return make_layout(coalesce(get<0>(layout)), get<1>(layout),
// allows TMA to load full 32-byte sectors get<2>(layout));
auto inner_layout =
make_layout(make_shape(_256{}, size<0>(layout) / _256{}));
return make_layout(inner_layout, get<1>(layout), get<2>(layout));
} }
// ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx) // ((BlockN, BlockK), (BlocksN, BlocksK), L) -> (storage_idx)

View File

@ -1,10 +0,0 @@
#pragma once
#include <torch/all.h>
// TODO(wentao): refactor the folder to 8bit, then includes fp8 and int8 folders
// 8-bit per-token-group quantization helper used by both FP8 and INT8
void per_token_group_quant_8bit(const torch::Tensor& input,
torch::Tensor& output_q,
torch::Tensor& output_s, int64_t group_size,
double eps, double min_8bit, double max_8bit,
bool scale_ue8m0 = false);

View File

@ -19,7 +19,7 @@
#include <c10/cuda/CUDAGuard.h> #include <c10/cuda/CUDAGuard.h>
#include <hip/hip_fp8.h> #include <hip/hip_fp8.h>
#include <hip/hip_bf16.h> #include <hip/hip_bf16.h>
#include "../cuda_compat.h" #include "cuda_compat.h"
#include <algorithm> #include <algorithm>
#include "../attention/dtype_fp8.cuh" #include "../attention/dtype_fp8.cuh"

View File

@ -9,7 +9,7 @@
#include <stdexcept> #include <stdexcept>
#include <algorithm> #include <algorithm>
#include "../cuda_compat.h" #include "cuda_compat.h"
#include "dispatch_utils.h" #include "dispatch_utils.h"
#include "quantization/fp8/common.cuh" #include "quantization/fp8/common.cuh"

View File

@ -20,17 +20,13 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
// vLLM custom ops // vLLM custom ops
// //
// The default behavior in PyTorch 2.6 was changed to "requires_contiguous", // The default behavior in PyTorch 2.6 is "requires_contiguous", so we need
// so we need
// to override this for many GEMMs with the following tag. Otherwise, // to override this for many GEMMs with the following tag. Otherwise,
// torch.compile will force all input tensors to be contiguous(), which // torch.compile will force all input tensors to be contiguous(), which
// will break many custom ops that require column-major weight matrices. // will break many custom ops that require column-major weight matrices.
// This was a bug and PyTorch 2.7 has since fixed this. // TODO: remove this for PyTorch 2.8, when the default is planned to switch
#if TORCH_VERSION_MAJOR == 2 && TORCH_VERSION_MINOR == 6 // to match exact eager-mode strides.
#define stride_tag at::Tag::needs_fixed_stride_order at::Tag stride_tag = at::Tag::needs_fixed_stride_order;
#else
#define stride_tag
#endif
ops.def("weak_ref_tensor(Tensor input) -> Tensor"); ops.def("weak_ref_tensor(Tensor input) -> Tensor");
ops.impl("weak_ref_tensor", torch::kCUDA, &weak_ref_tensor); ops.impl("weak_ref_tensor", torch::kCUDA, &weak_ref_tensor);
@ -525,14 +521,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
" Tensor page_table, Tensor workspace, float " " Tensor page_table, Tensor workspace, float "
"scale," "scale,"
" int num_kv_splits) -> ()"); " int num_kv_splits) -> ()");
// conditionally compiled so impl in source file ops.impl("sm100_cutlass_mla_decode", torch::kCUDA, &sm100_cutlass_mla_decode);
// SM100 CUTLASS MLA workspace // SM100 CUTLASS MLA workspace
ops.def( ops.def(
"sm100_cutlass_mla_get_workspace_size(int max_seq_len, int num_batches," "sm100_cutlass_mla_get_workspace_size(int max_seq_len, int num_batches,"
" int sm_count, int num_kv_splits) " " int sm_count, int num_kv_splits) "
"-> int"); "-> int");
// conditionally compiled so impl in source file ops.impl("sm100_cutlass_mla_get_workspace_size",
&sm100_cutlass_mla_get_workspace_size);
// Compute NVFP4 block quantized tensor. // Compute NVFP4 block quantized tensor.
ops.def( ops.def(
@ -615,23 +612,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("selective_scan_fwd", torch::kCUDA, &selective_scan_fwd); ops.impl("selective_scan_fwd", torch::kCUDA, &selective_scan_fwd);
#ifndef USE_ROCM #ifndef USE_ROCM
// Compute per-token-group FP8 quantized tensor and scaling factor.
ops.def(
"per_token_group_fp8_quant(Tensor input, Tensor! output_q, Tensor! "
"output_s, "
"int group_size, float eps, float fp8_min, float fp8_max, bool "
"scale_ue8m0) -> ()");
ops.impl("per_token_group_fp8_quant", torch::kCUDA,
&per_token_group_quant_fp8);
// Compute per-token-group INT8 quantized tensor and scaling factor.
ops.def(
"per_token_group_quant_int8(Tensor input, Tensor! output_q, Tensor! "
"output_s, int group_size, float eps, float int8_min, float int8_max) -> "
"()");
ops.impl("per_token_group_quant_int8", torch::kCUDA,
&per_token_group_quant_int8);
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel // reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
ops.def( ops.def(
"rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, " "rearrange_kn_weight_as_n32k16_order(Tensor b_qweight, Tensor b_scales, "

View File

@ -63,7 +63,7 @@ ARG PYTORCH_CUDA_NIGHTLY_INDEX_BASE_URL=https://download.pytorch.org/whl/nightly
ARG PIP_KEYRING_PROVIDER=disabled ARG PIP_KEYRING_PROVIDER=disabled
ARG UV_KEYRING_PROVIDER=${PIP_KEYRING_PROVIDER} ARG UV_KEYRING_PROVIDER=${PIP_KEYRING_PROVIDER}
# Flag enables built-in KV-connector dependency libs into docker images # Flag enables build-in KV-connector dependency libs into docker images
ARG INSTALL_KV_CONNECTORS=false ARG INSTALL_KV_CONNECTORS=false
#################### BASE BUILD IMAGE #################### #################### BASE BUILD IMAGE ####################
@ -207,19 +207,6 @@ ARG SCCACHE_ENDPOINT
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
ARG SCCACHE_REGION_NAME=us-west-2 ARG SCCACHE_REGION_NAME=us-west-2
ARG SCCACHE_S3_NO_CREDENTIALS=0 ARG SCCACHE_S3_NO_CREDENTIALS=0
# Flag to control whether to use pre-built vLLM wheels
ARG VLLM_USE_PRECOMPILED
# TODO: in setup.py VLLM_USE_PRECOMPILED is sensitive to truthiness, it will take =0 as "true", this should be fixed
ENV VLLM_USE_PRECOMPILED=""
RUN if [ "${VLLM_USE_PRECOMPILED}" = "1" ]; then \
export VLLM_USE_PRECOMPILED=1 && \
echo "Using precompiled wheels"; \
else \
unset VLLM_USE_PRECOMPILED && \
echo "Leaving VLLM_USE_PRECOMPILED unset to build wheels from source"; \
fi
# if USE_SCCACHE is set, use sccache to speed up compilation # if USE_SCCACHE is set, use sccache to speed up compilation
RUN --mount=type=cache,target=/root/.cache/uv \ RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=.git,target=.git \ --mount=type=bind,source=.git,target=.git \
@ -265,7 +252,7 @@ RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
#################### EXTENSION Build IMAGE #################### #################### EXTENSION Build IMAGE ####################
#################### DEV IMAGE #################### #################### DEV IMAGE ####################
FROM base AS dev FROM base as dev
ARG PIP_INDEX_URL UV_INDEX_URL ARG PIP_INDEX_URL UV_INDEX_URL
ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
@ -276,6 +263,10 @@ ARG PYTORCH_CUDA_INDEX_BASE_URL
ENV UV_HTTP_TIMEOUT=500 ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match" ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Workaround for #17068
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
COPY requirements/lint.txt requirements/lint.txt COPY requirements/lint.txt requirements/lint.txt
COPY requirements/test.txt requirements/test.txt COPY requirements/test.txt requirements/test.txt
COPY requirements/dev.txt requirements/dev.txt COPY requirements/dev.txt requirements/dev.txt
@ -384,33 +375,48 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
# -rw-rw-r-- 1 mgoin mgoin 205M Jun 9 18:03 flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl # -rw-rw-r-- 1 mgoin mgoin 205M Jun 9 18:03 flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl
# $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/v0.2.6.post1/flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl # $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/v0.2.6.post1/flashinfer_python-0.2.6.post1-cp39-abi3-linux_x86_64.whl
# Install FlashInfer from source # Allow specifying a version, Git revision or local .whl file
ARG FLASHINFER_CUDA128_INDEX_URL="https://download.pytorch.org/whl/cu128/flashinfer"
ARG FLASHINFER_CUDA128_WHEEL="flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl"
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
ARG FLASHINFER_GIT_REF="v0.2.9rc1" ARG FLASHINFER_GIT_REF="v0.2.8rc1"
# Flag to control whether to use pre-built FlashInfer wheels (set to false to force build from source)
# TODO: Currently disabled because the pre-built wheels are not available for FLASHINFER_GIT_REF
ARG USE_FLASHINFER_PREBUILT_WHEEL=false
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
. /etc/environment . /etc/environment
git clone --depth 1 --recursive --shallow-submodules \ if [ "$TARGETPLATFORM" != "linux/arm64" ]; then
--branch ${FLASHINFER_GIT_REF} \ # FlashInfer already has a wheel for PyTorch 2.7.0 and CUDA 12.8. This is enough for CI use
${FLASHINFER_GIT_REPO} flashinfer if [[ "$CUDA_VERSION" == 12.8* ]] && [[ "$USE_FLASHINFER_PREBUILT_WHEEL" == "true" ]]; then
# Exclude CUDA arches for older versions (11.x and 12.0-12.7) uv pip install --system ${FLASHINFER_CUDA128_INDEX_URL}/${FLASHINFER_CUDA128_WHEEL}
# TODO: Update this to allow setting TORCH_CUDA_ARCH_LIST as a build arg. else
if [[ "${CUDA_VERSION}" == 11.* ]]; then # Exclude CUDA arches for older versions (11.x and 12.0-12.7)
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9" # TODO: Update this to allow setting TORCH_CUDA_ARCH_LIST as a build arg.
elif [[ "${CUDA_VERSION}" == 12.[0-7]* ]]; then if [[ "${CUDA_VERSION}" == 11.* ]]; then
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a" FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9"
else elif [[ "${CUDA_VERSION}" == 12.[0-7]* ]]; then
# CUDA 12.8+ supports 10.0a and 12.0 FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a"
FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 12.0" else
fi # CUDA 12.8+ supports 10.0a and 12.0
echo "🏗️ Building FlashInfer for arches: ${FI_TORCH_CUDA_ARCH_LIST}" FI_TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 12.0"
# Needed to build AOT kernels fi
pushd flashinfer echo "🏗️ Building FlashInfer for arches: ${FI_TORCH_CUDA_ARCH_LIST}"
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
python3 -m flashinfer.aot git clone --depth 1 --recursive --shallow-submodules \
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ --branch ${FLASHINFER_GIT_REF} \
uv pip install --system --no-build-isolation . ${FLASHINFER_GIT_REPO} flashinfer
popd
rm -rf flashinfer # Needed to build AOT kernels
pushd flashinfer
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
python3 -m flashinfer.aot
TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \
uv pip install --system --no-build-isolation .
popd
rm -rf flashinfer
fi \
fi
BASH BASH
COPY examples examples COPY examples examples
COPY benchmarks benchmarks COPY benchmarks benchmarks
@ -448,6 +454,10 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
ENV UV_HTTP_TIMEOUT=500 ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match" ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Workaround for #17068
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
# install development dependencies (for testing) # install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \ RUN --mount=type=cache,target=/root/.cache/uv \
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \ CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
@ -498,11 +508,10 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/kv_connectors.txt; \ uv pip install --system -r requirements/kv_connectors.txt; \
fi; \ fi; \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \ if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
BITSANDBYTES_VERSION="0.42.0"; \ uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
else \ else \
BITSANDBYTES_VERSION="0.46.1"; \ uv pip install --system accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.46.1' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
fi; \ fi
uv pip install --system accelerate hf_transfer modelscope "bitsandbytes>=${BITSANDBYTES_VERSION}" 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]
ENV VLLM_USAGE_SOURCE production-docker-image ENV VLLM_USAGE_SOURCE production-docker-image

62
docker/Dockerfile.arm Normal file
View File

@ -0,0 +1,62 @@
# This vLLM Dockerfile is used to construct an image that can build and run vLLM on ARM CPU platform.
FROM ubuntu:22.04 AS cpu-test-arm
ENV CCACHE_DIR=/root/.cache/ccache
ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache
RUN --mount=type=cache,target=/var/cache/apt \
apt-get update -y \
&& apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
# tcmalloc provides better memory allocation efficiency, e.g., holding memory in caches to speed up access of commonly-used objects.
RUN --mount=type=cache,target=/root/.cache/pip \
pip install py-cpuinfo # Use this to gather CPU info and optimize based on ARM Neoverse cores
# Set LD_PRELOAD for tcmalloc on ARM
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
RUN echo 'ulimit -c 0' >> ~/.bashrc
WORKDIR /workspace
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements/build.txt,target=requirements/build.txt \
pip install --upgrade pip && \
pip install -r requirements/build.txt
FROM cpu-test-arm AS build
WORKDIR /workspace/vllm
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements/common.txt,target=requirements/common.txt \
--mount=type=bind,src=requirements/cpu.txt,target=requirements/cpu.txt \
pip install -v -r requirements/cpu.txt
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Disabling AVX512 specific optimizations for ARM
ARG VLLM_CPU_DISABLE_AVX512="true"
ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/ccache \
--mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \
pip install dist/*.whl && \
rm -rf dist
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@ -1,11 +1,4 @@
# This vLLM Dockerfile is used to build images that can run vLLM on both x86_64 and arm64 CPU platforms. # This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.
#
# Supported platforms:
# - linux/amd64 (x86_64)
# - linux/arm64 (aarch64)
#
# Use the `--platform` option with `docker buildx build` to specify the target architecture, e.g.:
# docker buildx build --platform=linux/arm64 -f docker/Dockerfile.cpu .
# #
# Build targets: # Build targets:
# vllm-openai (default): used for serving deployment # vllm-openai (default): used for serving deployment
@ -60,20 +53,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --upgrade pip && \ uv pip install --upgrade pip && \
uv pip install -r requirements/cpu.txt uv pip install -r requirements/cpu.txt
ARG TARGETARCH ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so:$LD_PRELOAD"
ENV TARGETARCH=${TARGETARCH}
RUN if [ "$TARGETARCH" = "arm64" ]; then \
PRELOAD_PATH="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"; \
else \
PRELOAD_PATH="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so"; \
fi && \
echo "export LD_PRELOAD=$PRELOAD_PATH" >> ~/.bashrc
# Ensure that the LD_PRELOAD environment variable for export is in effect.
SHELL ["/bin/bash", "-c"]
ENV LD_PRELOAD=${LD_PRELOAD}
RUN echo 'ulimit -c 0' >> ~/.bashrc RUN echo 'ulimit -c 0' >> ~/.bashrc
@ -115,7 +95,7 @@ WORKDIR /workspace/vllm
RUN --mount=type=bind,src=requirements/test.in,target=requirements/test.in \ RUN --mount=type=bind,src=requirements/test.in,target=requirements/test.in \
cp requirements/test.in requirements/cpu-test.in && \ cp requirements/test.in requirements/cpu-test.in && \
sed -i '/mamba_ssm/d' requirements/cpu-test.in && \ sed -i '/mamba_ssm/d' requirements/cpu-test.in && \
sed -i 's/^torch==.*/torch==2.6.0/g' requirements/cpu-test.in && \ sed -i 's/torch==.*/torch==2.6.0/g' requirements/cpu-test.in && \
sed -i 's/torchaudio.*/torchaudio/g' requirements/cpu-test.in && \ sed -i 's/torchaudio.*/torchaudio/g' requirements/cpu-test.in && \
sed -i 's/torchvision.*/torchvision/g' requirements/cpu-test.in && \ sed -i 's/torchvision.*/torchvision/g' requirements/cpu-test.in && \
uv pip compile requirements/cpu-test.in -o requirements/cpu-test.txt --index-strategy unsafe-best-match --torch-backend cpu uv pip compile requirements/cpu-test.in -o requirements/cpu-test.txt --index-strategy unsafe-best-match --torch-backend cpu

21
docker/Dockerfile.hpu Normal file
View File

@ -0,0 +1,21 @@
FROM vault.habana.ai/gaudi-docker/1.20.1/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
COPY ./ /workspace/vllm
WORKDIR /workspace/vllm
RUN pip install -v -r requirements/hpu.txt
ENV no_proxy=localhost,127.0.0.1
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
RUN VLLM_TARGET_DEVICE=hpu python3 setup.py install
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@ -1,5 +1,5 @@
ARG NIGHTLY_DATE="20250724" ARG NIGHTLY_DATE="20250124"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE" ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"
FROM $BASE_IMAGE FROM $BASE_IMAGE
WORKDIR /workspace/vllm WORKDIR /workspace/vllm

View File

@ -47,7 +47,7 @@ FROM vllm-base AS vllm-openai
# install additional dependencies for openai api server # install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer pytest pytest_asyncio lm_eval[api] modelscope pip install accelerate hf_transfer pytest 'modelscope!=1.15.0'
ENV VLLM_USAGE_SOURCE production-docker-image \ ENV VLLM_USAGE_SOURCE production-docker-image \
TRITON_XPU_PROFILE 1 TRITON_XPU_PROFILE 1

View File

@ -8,12 +8,14 @@ API documentation for vLLM's configuration classes.
- [vllm.config.ModelConfig][] - [vllm.config.ModelConfig][]
- [vllm.config.CacheConfig][] - [vllm.config.CacheConfig][]
- [vllm.config.TokenizerPoolConfig][]
- [vllm.config.LoadConfig][] - [vllm.config.LoadConfig][]
- [vllm.config.ParallelConfig][] - [vllm.config.ParallelConfig][]
- [vllm.config.SchedulerConfig][] - [vllm.config.SchedulerConfig][]
- [vllm.config.DeviceConfig][] - [vllm.config.DeviceConfig][]
- [vllm.config.SpeculativeConfig][] - [vllm.config.SpeculativeConfig][]
- [vllm.config.LoRAConfig][] - [vllm.config.LoRAConfig][]
- [vllm.config.PromptAdapterConfig][]
- [vllm.config.MultiModalConfig][] - [vllm.config.MultiModalConfig][]
- [vllm.config.PoolerConfig][] - [vllm.config.PoolerConfig][]
- [vllm.config.DecodingConfig][] - [vllm.config.DecodingConfig][]

Binary file not shown.

Before

Width:  |  Height:  |  Size: 57 KiB

After

Width:  |  Height:  |  Size: 68 KiB

View File

@ -1,7 +1,3 @@
---
toc_depth: 4
---
# vLLM CLI Guide # vLLM CLI Guide
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
@ -46,10 +42,6 @@ Start the vLLM OpenAI Compatible API server.
vllm serve --help=page vllm serve --help=page
``` ```
### Options
--8<-- "docs/argparse/serve.md"
## chat ## chat
Generate chat completions via the running API server. Generate chat completions via the running API server.

View File

@ -14,7 +14,7 @@ For example:
```python ```python
from vllm import LLM from vllm import LLM
llm = LLM( model = LLM(
model="cerebras/Cerebras-GPT-1.3B", model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2 hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
) )

View File

@ -5,7 +5,7 @@ The `vllm serve` command is used to launch the OpenAI-compatible server.
## CLI Arguments ## CLI Arguments
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
To see the available options, take a look at the [CLI Reference](../cli/README.md#options)! To see the available CLI arguments, run `vllm serve --help`!
## Configuration file ## Configuration file

View File

@ -98,7 +98,7 @@ For additional features and advanced configurations, refer to the official [MkDo
??? console "Commands" ??? console "Commands"
```bash ```bash
pip install -r requirements/common.txt -r requirements/dev.txt pip install -r requirements/dev.txt
# Linting, formatting and static type checking # Linting, formatting and static type checking
pre-commit install --hook-type pre-commit --hook-type commit-msg pre-commit install --hook-type pre-commit --hook-type commit-msg

View File

@ -134,7 +134,7 @@ MAX_JOBS=16 uv pip install --system \
```bash ```bash
uv pip install --system \ uv pip install --system \
--no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.5" --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
``` ```
### causal-conv1d ### causal-conv1d

View File

@ -9,13 +9,10 @@ We support tracing vLLM workers using the `torch.profiler` module. You can enabl
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set. The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
When using `vllm bench serve`, you can enable profiling by passing the `--profile` flag. When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
Traces can be visualized using <https://ui.perfetto.dev/>. Traces can be visualized using <https://ui.perfetto.dev/>.
!!! tip
You can directly call bench module without installing vllm using `python -m vllm.entrypoints.cli.main bench`.
!!! tip !!! tip
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly. Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
@ -38,10 +35,10 @@ VLLM_TORCH_PROFILER_DIR=./vllm_profile \
--model meta-llama/Meta-Llama-3-70B --model meta-llama/Meta-Llama-3-70B
``` ```
vllm bench command: benchmark_serving.py:
```bash ```bash
vllm bench serve \ python benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model meta-llama/Meta-Llama-3-70B \ --model meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \ --dataset-name sharegpt \
@ -72,13 +69,13 @@ apt install nsight-systems-cli
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference. For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
The following is an example using the `vllm bench latency` script: The following is an example using the `benchmarks/benchmark_latency.py` script:
```bash ```bash
nsys profile -o report.nsys-rep \ nsys profile -o report.nsys-rep \
--trace-fork-before-exec=true \ --trace-fork-before-exec=true \
--cuda-graph-trace=node \ --cuda-graph-trace=node \
vllm bench latency \ python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \ --model meta-llama/Llama-3.1-8B-Instruct \
--num-iters-warmup 5 \ --num-iters-warmup 5 \
--num-iters 1 \ --num-iters 1 \
@ -101,7 +98,7 @@ nsys profile -o report.nsys-rep \
vllm serve meta-llama/Llama-3.1-8B-Instruct vllm serve meta-llama/Llama-3.1-8B-Instruct
# client # client
vllm bench serve \ python benchmarks/benchmark_serving.py \
--backend vllm \ --backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \ --model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 1 \ --num-prompts 1 \
@ -135,7 +132,7 @@ You can view these profiles either as summaries in the CLI, using `nsys stats [p
... ...
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum): ** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ---------------------------------------------------------------------------------------------------- -------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of… 46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of… 14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
@ -146,7 +143,7 @@ You can view these profiles either as summaries in the CLI, using `nsys stats [p
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern… 2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in… 1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0… 0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
... ...
``` ```
GUI example: GUI example:

View File

@ -3,15 +3,6 @@
[](){ #deployment-anyscale } [](){ #deployment-anyscale }
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
It hosts Ray clusters inside your own AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, or managing observability stacks.
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like <gh-file:examples/online_serving/run_cluster.sh>.
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm). When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).
## Production-ready vLLM on Anyscale quickstarts
- [Offline batch inference](https://console.anyscale.com/template-preview/llm_batch_inference?utm_source=vllm_docs)
- [Deploy vLLM services](https://console.anyscale.com/template-preview/llm_serving?utm_source=vllm_docs)
- [Curate a dataset](https://console.anyscale.com/template-preview/audio-dataset-curation-llm-judge?utm_source=vllm_docs)
- [Finetune an LLM](https://console.anyscale.com/template-preview/entity-recognition-with-llms?utm_source=vllm_docs)

View File

@ -1,42 +1,26 @@
# Open WebUI # Open WebUI
[Open WebUI](https://github.com/open-webui/open-webui) is an extensible, feature-rich, 1. Install the [Docker](https://docs.docker.com/engine/install/)
and user-friendly self-hosted AI platform designed to operate entirely offline.
It supports various LLM runners like Ollama and OpenAI-compatible APIs,
with built-in RAG capabilities, making it a powerful AI deployment solution.
To get started with Open WebUI using vLLM, follow these steps: 2. Start the vLLM server with the supported chat completion model, e.g.
1. Install the [Docker](https://docs.docker.com/engine/install/). ```bash
vllm serve qwen/Qwen1.5-0.5B-Chat
```
2. Start the vLLM server with a supported chat completion model: 1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
```console ```bash
vllm serve Qwen/Qwen3-0.6B-Chat docker run -d -p 3000:8080 \
``` --name open-webui \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL=http://<vllm serve host>:<vllm serve port>/v1 \
--restart always \
ghcr.io/open-webui/open-webui:main
```
!!! note 1. Open it in the browser: <http://open-webui-host:3000/>
When starting the vLLM server, be sure to specify the host and port using the `--host` and `--port` flags.
For example:
```console On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`.
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000
```
3. Start the Open WebUI Docker container: ![](../../assets/deployment/open_webui.png)
```console
docker run -d \
--name open-webui \
-p 3000:8080 \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URL=http://0.0.0.0:8000/v1 \
--restart always \
ghcr.io/open-webui/open-webui:main
```
4. Open it in the browser: <http://open-webui-host:3000/>
At the top of the page, you should see the model `Qwen/Qwen3-0.6B-Chat`.
![Web portal of model Qwen/Qwen3-0.6B-Chat](../../assets/deployment/open_webui.png)

View File

@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
## Objectives ## Objectives
- Achieve parity of metrics between v0 and v1. - Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments. - The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases. - Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
## Background ## Background
Metrics in vLLM can be categorized as follows: Metrics in vLLM can be categorized as follows:
1. Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus. 1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
2. Request-level metrics: Metrics that track the characteristics (e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking. 2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
The mental model is that server-level metrics help explain the values of request-level metrics. The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
### v0 Metrics ### v0 Metrics
@ -61,24 +61,24 @@ These are documented under [Inferencing and Serving -> Production Metrics](../..
### Grafana Dashboard ### Grafana Dashboard
vLLM also provides [a reference example](../../examples/online_serving/prometheus_grafana.md) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard. vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important: The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds. - `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds
- `vllm:prompt_tokens_total` - Prompt tokens. - `vllm:prompt_tokens_total` - Prompt Tokens
- `vllm:generation_tokens_total` - Generation tokens. - `vllm:generation_tokens_total` - Generation Tokens
- `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds. - `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second.
- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds. - `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states. - `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state
- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM. - `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
- `vllm:request_prompt_tokens` - Request prompt length. - `vllm:request_prompt_tokens` - Request prompt length
- `vllm:request_generation_tokens` - Request generation length. - `vllm:request_generation_tokens` - request generation length
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached. - `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
- `vllm:request_queue_time_seconds` - Queue time. - `vllm:request_queue_time_seconds` - Queue Time
- `vllm:request_prefill_time_seconds` - Requests prefill time. - `vllm:request_prefill_time_seconds` - Requests Prefill Time
- `vllm:request_decode_time_seconds` - Requests decode time. - `vllm:request_decode_time_seconds` - Requests Decode Time
- `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group. - `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here. See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process
### Built in Python/Process Metrics ### Built in Python/Process Metrics
The following metrics are supported by default by `prometheus_client`, but they are not exposed when multi-process mode is used: The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used:
- `python_gc_objects_collected_total` - `python_gc_objects_collected_total`
- `python_gc_objects_uncollectable_total` - `python_gc_objects_uncollectable_total`
@ -158,7 +158,6 @@ In v1, we wish to move computation and overhead out of the engine core
process to minimize the time between each forward pass. process to minimize the time between each forward pass.
The overall idea of V1 EngineCore design is: The overall idea of V1 EngineCore design is:
- EngineCore is the inner loop. Performance is most critical here - EngineCore is the inner loop. Performance is most critical here
- AsyncLLM is the outer loop. This is overlapped with GPU execution - AsyncLLM is the outer loop. This is overlapped with GPU execution
(ideally), so this is where any "overheads" should be if (ideally), so this is where any "overheads" should be if
@ -179,7 +178,7 @@ time" (`time.time()`) to calculate intervals as the former is
unaffected by system clock changes (e.g. from NTP). unaffected by system clock changes (e.g. from NTP).
It's also important to note that monotonic clocks differ between It's also important to note that monotonic clocks differ between
processes - each process has its own reference point. So it is processes - each process has its own reference. point. So it is
meaningless to compare monotonic timestamps from different processes. meaningless to compare monotonic timestamps from different processes.
Therefore, in order to calculate an interval, we must compare two Therefore, in order to calculate an interval, we must compare two
@ -344,15 +343,14 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0 vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
``` ```
!!! note Note - the choice of histogram buckets to be most useful to users
The choice of histogram buckets to be most useful to users across a broad set of use cases is not straightforward and will
across a broad set of use cases is not straightforward and will require refinement over time.
require refinement over time.
### Cache Config Info ### Cache Config Info
`prometheus_client` has support for `prometheus_client` has support for [Info
[Info metrics](https://prometheus.github.io/client_python/instrumenting/info/) metrics](https://prometheus.github.io/client_python/instrumenting/info/)
which are equivalent to a `Gauge` whose value is permanently set to 1, which are equivalent to a `Gauge` whose value is permanently set to 1,
but exposes interesting key/value pair information via labels. This is but exposes interesting key/value pair information via labels. This is
used for information about an instance that does not change - so it used for information about an instance that does not change - so it
@ -365,11 +363,14 @@ We use this concept for the `vllm:cache_config_info` metric:
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge # TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
``` ```
However, `prometheus_client` has However, `prometheus_client` has [never supported Info metrics in
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) - multiprocessing
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We mode](https://github.com/prometheus/client_python/pull/300) - for
[unclear
reasons](gh-pr:7279#discussion_r1710417152). We
simply use a `Gauge` metric set to 1 and simply use a `Gauge` metric set to 1 and
`multiprocess_mode="mostrecent"` instead. `multiprocess_mode="mostrecent"` instead.
@ -394,9 +395,11 @@ distinguish between per-adapter counts. This should be revisited.
Note that `multiprocess_mode="livemostrecent"` is used - the most Note that `multiprocess_mode="livemostrecent"` is used - the most
recent metric is used, but only from currently running processes. recent metric is used, but only from currently running processes.
This was added in <gh-pr:9477> and there is This was added in
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). <gh-pr:9477> and there is
If we revisit this design and deprecate the old metric, we should reduce [at least one known
user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in the need for a significant deprecation period by making the change in
v0 also and asking this project to move to the new metric. v0 also and asking this project to move to the new metric.
@ -439,20 +442,23 @@ suddenly (from their perspective) when it is removed, even if there is
an equivalent metric for them to use. an equivalent metric for them to use.
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
[deprecated](gh-pr:2764) (with a comment in the code), [deprecated](gh-pr:2764) (with a
[removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218). comment in the code),
[removed](gh-pr:12383), and then
[noticed by a
user](gh-issue:13218).
In general: In general:
1. We should be cautious about deprecating metrics, especially since 1) We should be cautious about deprecating metrics, especially since
it can be hard to predict the user impact. it can be hard to predict the user impact.
2. We should include a prominent deprecation notice in the help string 2) We should include a prominent deprecation notice in the help string
that is included in the `/metrics' output. that is included in the `/metrics' output.
3. We should list deprecated metrics in user-facing documentation and 3) We should list deprecated metrics in user-facing documentation and
release notes. release notes.
4. We should consider hiding deprecated metrics behind a CLI argument 4) We should consider hiding deprecated metrics behind a CLI argument
in order to give administrators in order to give administrators [an escape
[an escape hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics) hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them. for some time before deleting them.
See the [deprecation policy](../../contributing/deprecation_policy.md) for See the [deprecation policy](../../contributing/deprecation_policy.md) for
@ -468,7 +474,7 @@ removed.
The `vllm:time_in_queue_requests` Histogram metric was added by The `vllm:time_in_queue_requests` Histogram metric was added by
<gh-pr:9659> and its calculation is: <gh-pr:9659> and its calculation is:
```python ```
self.metrics.first_scheduled_time = now self.metrics.first_scheduled_time = now
self.metrics.time_in_queue = now - self.metrics.arrival_time self.metrics.time_in_queue = now - self.metrics.arrival_time
``` ```
@ -476,7 +482,7 @@ The `vllm:time_in_queue_requests` Histogram metric was added by
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
us with: us with:
```python ```
if seq_group.is_finished(): if seq_group.is_finished():
if (seq_group.metrics.first_scheduled_time is not None and if (seq_group.metrics.first_scheduled_time is not None and
seq_group.metrics.first_token_time is not None): seq_group.metrics.first_token_time is not None):
@ -511,7 +517,8 @@ cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`. with `--swap-space` and `--preemption-mode`.
In v0, [vLLM has long supported beam search](gh-issue:6226). The In v0, [vLLM has long supported beam
search](gh-issue:6226). The
SequenceGroup encapsulated the idea of N Sequences which SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU sharing between requests, and copy-on-write to do branching. CPU
@ -523,8 +530,9 @@ option than CPU swapping since blocks can be evicted slowly on demand
and the part of the prompt that was evicted can be recomputed. and the part of the prompt that was evicted can be recomputed.
SequenceGroup was removed in V1, although a replacement will be SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`). required for "parallel sampling" (`n>1`). [Beam search was moved out of
[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a the core (in
V0)](gh-issue:8306). There was a
lot of complex code for a very uncommon feature. lot of complex code for a very uncommon feature.
In V1, with prefix caching being better (zero over head) and therefore In V1, with prefix caching being better (zero over head) and therefore
@ -539,18 +547,18 @@ Some v0 metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt. request multiple completions from the same prompt.
As part of adding parallel sampling support in <gh-pr:10980>, we should As part of adding parallel sampling support in <gh-pr:10980> we should
also add these metrics. also add these metrics.
- `vllm:request_params_n` (Histogram) - `vllm:request_params_n` (Histogram)
Observes the value of the 'n' parameter of every finished request. Observes the value of the 'n' parameter of every finished request.
- `vllm:request_max_num_generation_tokens` (Histogram) - `vllm:request_max_num_generation_tokens` (Histogram)
Observes the maximum output length of all sequences in every finished Observes the maximum output length of all sequences in every finished
sequence group. In the absence of parallel sampling, this is sequence group. In the absence of parallel sampling, this is
equivalent to `vllm:request_generation_tokens`. equivalent to `vllm:request_generation_tokens`.
### Speculative Decoding ### Speculative Decoding
@ -568,23 +576,26 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
seculative decoding to v1. Other techniques will follow. We should seculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context. revisit the v0 metrics in this context.
!!! note Note - we should probably expose acceptance rate as separate accepted
We should probably expose acceptance rate as separate accepted and draft counters, like we do for prefix caching hit rate. Efficiency
and draft counters, like we do for prefix caching hit rate. Efficiency likely also needs similar treatment.
likely also needs similar treatment.
### Autoscaling and Load-balancing ### Autoscaling and Load-balancing
A common use case for our metrics is to support automated scaling of A common use case for our metrics is to support automated scaling of
vLLM instances. vLLM instances.
For related discussion from the For related discussion from the [Kubernetes Serving Working
[Kubernetes Serving Working Group](https://github.com/kubernetes/community/tree/master/wg-serving), Group](https://github.com/kubernetes/community/tree/master/wg-serving),
see: see:
- [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) - [Standardizing Large Model Server Metrics in
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ) Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf) - [Benchmarking LLM Workloads for Performance Evaluation and
Autoscaling in
Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference
Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <gh-issue:5041> and <gh-pr:12726>. - <gh-issue:5041> and <gh-pr:12726>.
This is a non-trivial topic. Consider this comment from Rob: This is a non-trivial topic. Consider this comment from Rob:
@ -608,16 +619,19 @@ should judge an instance as approaching saturation:
Our approach to naming metrics probably deserves to be revisited: Our approach to naming metrics probably deserves to be revisited:
1. The use of colons in metric names seems contrary to 1. The use of colons in metric names seems contrary to ["colons are
["colons are reserved for user defined recording rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). reserved for user defined recording
rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels)
2. Most of our metrics follow the convention of ending with units, but 2. Most of our metrics follow the convention of ending with units, but
not all do. not all do.
3. Some of our metric names end with `_total`: 3. Some of our metric names end with `_total`:
If there is a suffix of `_total` on the metric name, it will be removed. When ```
exposing the time series for counter, a `_total` suffix will be added. This is If there is a suffix of `_total` on the metric name, it will be removed. When
for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics exposing the time series for counter, a `_total` suffix will be added. This is
requires the `_total` suffix. for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
requires the `_total` suffix.
```
### Adding More Metrics ### Adding More Metrics
@ -628,7 +642,8 @@ There is no shortage of ideas for new metrics:
- Proposals arising from specific use cases, like the Kubernetes - Proposals arising from specific use cases, like the Kubernetes
auto-scaling topic above auto-scaling topic above
- Proposals that might arise out of standardisation efforts like - Proposals that might arise out of standardisation efforts like
[OpenTelemetry Semantic Conventions for Gen AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai). [OpenTelemetry Semantic Conventions for Gen
AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
We should be cautious in our approach to adding new metrics. While We should be cautious in our approach to adding new metrics. While
metrics are often relatively straightforward to add: metrics are often relatively straightforward to add:
@ -653,14 +668,19 @@ fall under the more general heading of "Observability".
v0 has support for OpenTelemetry tracing: v0 has support for OpenTelemetry tracing:
- Added by <gh-pr:4687> - Added by <gh-pr:4687>
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` - Configured with `--oltp-traces-endpoint` and
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) `--collect-detailed-traces`
- [User-facing docs](../../examples/online_serving/opentelemetry.md) - [OpenTelemetry blog
- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f) post](https://opentelemetry.io/blog/2024/llm-observability/)
- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview) - [User-facing
docs](https://docs.vllm.ai/en/latest/examples/opentelemetry.html)
- [Blog
post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
- [IBM product
docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
OpenTelemetry has a OpenTelemetry has a [Gen AI Working
[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md). Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
Since metrics is a big enough topic on its own, we are going to tackle Since metrics is a big enough topic on its own, we are going to tackle
the topic of tracing in v1 separately. the topic of tracing in v1 separately.
@ -679,7 +699,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled
and if `--collect-detailed-traces=all/model/worker` is used. The and if `--collect-detailed-traces=all/model/worker` is used. The
documentation for this option states: documentation for this option states:
> collect detailed traces for the specified modules. This involves > collect detailed traces for the specified "modules. This involves
> use of possibly costly and or blocking operations and hence might > use of possibly costly and or blocking operations and hence might
> have a performance impact. > have a performance impact.

View File

@ -3,14 +3,14 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica
# Detailed Design # Detailed Design
## Overall Process ## Overall Process
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**. 2. The Proxy/Router selects a **1P1D (1 Prefill instance + 1 Decode instance)** through either through round-robin or random selection, generates a `request_id` (rules to be introduced later), modifies the `max_tokens` in the HTTP request message to **1**, and then forwards the request to the **P instance**.
3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**. 3. Immediately afterward, the Proxy/Router forwards the **original HTTP request** to the **D instance**.
4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`. 4. The **P instance** performs **Prefill** and then **actively sends the generated KV cache** to the D instance (using **PUT_ASYNC** mode). The D instance's `zmq_addr` can be resolved through the `request_id`.
5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**. 5. The **D instance** has a **dedicated thread** for receiving the KV cache (to avoid blocking the main process). The received KV cache is saved into the **GPU memory buffer**, the size of which is determined by the vLLM startup parameter `kv_buffer_size`. When the GPU buffer is full, the KV cache is stored in the **local Tensor memory pool**.
6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**. 6. During the **Decode**, the D instance's main process retrieves the KV cache (transmitted by the P instance) from either the **GPU buffer** or the **memory pool**, thereby **skipping Prefill**.
7. After completing **Decode**, the D instance returns the result to the **Proxy/Router**, which then forwards it to the **client**. 7. After completing **Decode**, the D instance returns the result to the **Proxy/Router**, which then forwards it to the **client**.
![image1](https://github.com/user-attachments/assets/fb01bde6-755b-49f7-ad45-48a94b1e10a7) ![image1](https://github.com/user-attachments/assets/fb01bde6-755b-49f7-ad45-48a94b1e10a7)
@ -31,7 +31,7 @@ Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (cur
## KV Cache Transfer Methods ## KV Cache Transfer Methods
There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache. There are three methods for KVcache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVcache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVcache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVcache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVcache from the P instance once it has allocated space for the KVcache.
Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT. Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
@ -39,13 +39,13 @@ Experimental results have shown that the performance of these methods, from high
As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart.
Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVCache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVCache data itself. Each P/D instance only needs to create a single `P2pNcclEngine` instance. This instance maintains a ZMQ Server, which runs a dedicated thread to listen on the `zmq_addr` address and receive control flow requests from other instances. These requests include requests to establish an NCCL connection and requests to send KVcache metadata (such as tensor shapes and data types). However, it does not actually transmit the KVcache data itself.
When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size. When a P instance and a D instance transmit KVcache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVcache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVcache transmission can be performed, without being restricted by rank or world size.
## NCCL Group Topology ## NCCL Group Topology
Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVcache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance.
![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36) ![image2](https://github.com/user-attachments/assets/837e61d6-365e-4cbf-8640-6dd7ab295b36)
@ -53,17 +53,33 @@ Each NCCL group occupies a certain amount of GPU memory buffer for communication
## GPU Memory Buffer and Tensor Memory Pool ## GPU Memory Buffer and Tensor Memory Pool
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%10% of the memory size. The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVcache sent by P instances. If it is too large, it will reduce the KVcache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%10% of the memory size.
If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVCache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVCache loss. Once KVCache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance. If the `--max-num-seqs` parameter for P instances is set to a large value, due to the large batch size, P instances will generate a large amount of KVcache simultaneously. This may exceed the capacity of the memory buffer of D instances, resulting in KVcache loss. Once KVcache is lost, D instances need to recompute Prefill, which is equivalent to performing Prefill twice. Consequently, the time-to-first-token (TTFT) will significantly increase, leading to degraded performance.
To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. To address the above issues, I have designed and developed a local Tensor memory pool for storing KVcache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVcache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVcache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store.
# Install vLLM # Install vLLM
```shell ??? console "Commands"
pip install "vllm>=0.9.2"
``` ```shell
# Enter the home directory or your working directory.
cd /home
# Download the installation package, and I will update the commit-id in time. You can directly copy the command.
wget https://vllm-wheels.s3.us-west-2.amazonaws.com/9112b443a042d8d815880b8780633882ad32b183/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# Download the code repository.
git clone -b xpyd-v1 https://github.com/Abatom/vllm.git
cd vllm
# Set the installation package path.
export VLLM_PRECOMPILED_WHEEL_LOCATION=/home/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
# installation
pip install -e . -v
```
# Run xPyD # Run xPyD
@ -74,7 +90,7 @@ pip install "vllm>=0.9.2"
- You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict). - You may need to modify the `kv_buffer_size` and `port` in the following commands (if there is a conflict).
- `PUT_ASYNC` offers the best performance and should be prioritized. - `PUT_ASYNC` offers the best performance and should be prioritized.
- The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`. - The `--port` must be consistent with the `http_port` in the `--kv-transfer-config`.
- The `disagg_proxy_p2p_nccl_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances). - The `disagg_prefill_proxy_xpyd.py` script will use port 10001 (for receiving client requests) and port 30001 (for receiving service discovery from P and D instances).
- The node running the proxy must have `quart` installed. - The node running the proxy must have `quart` installed.
- Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`. - Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`.
- In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**. - In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**.
@ -84,8 +100,8 @@ pip install "vllm>=0.9.2"
### Proxy (e.g. 10.0.1.1) ### Proxy (e.g. 10.0.1.1)
```shell ```shell
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ cd {your vllm directory}/examples/online_serving/disagg_xpyd/
python3 disagg_proxy_p2p_nccl_xpyd.py & python3 disagg_prefill_proxy_xpyd.py &
``` ```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
@ -95,7 +111,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20001 \ --port 20005 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -107,7 +123,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) ### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
@ -117,7 +133,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20009 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -129,7 +145,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) ### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
@ -151,7 +167,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) ### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
@ -161,7 +177,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20004 \ --port 20008 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -173,7 +189,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
## Run 3P1D ## Run 3P1D
@ -181,8 +197,8 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
### Proxy (e.g. 10.0.1.1) ### Proxy (e.g. 10.0.1.1)
```shell ```shell
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ cd {your vllm directory}/examples/online_serving/disagg_xpyd/
python3 disagg_proxy_p2p_nccl_xpyd.py & python3 disagg_prefill_proxy_xpyd.py &
``` ```
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
@ -192,7 +208,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20001 \ --port 20005 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -204,7 +220,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) ### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
@ -214,7 +230,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20002 \ --port 20009 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -226,7 +242,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20009","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) ### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
@ -248,7 +264,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.9 \ --gpu-memory-utilization 0.9 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) ### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
@ -258,7 +274,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
```shell ```shell
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 20004 \ --port 20008 \
--tensor-parallel-size 1 \ --tensor-parallel-size 1 \
--seed 1024 \ --seed 1024 \
--served-model-name base_model \ --served-model-name base_model \
@ -270,7 +286,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
--gpu-memory-utilization 0.7 \ --gpu-memory-utilization 0.7 \
--disable-log-request \ --disable-log-request \
--kv-transfer-config \ --kv-transfer-config \
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20008","send_type":"PUT_ASYNC","nccl_num_channels":"16"}}' > /var/vllm.log 2>&1 &
``` ```
# Single request # Single request
@ -291,7 +307,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
??? console "Command" ??? console "Command"
```shell ```shell
vllm bench serve \ python3 benchmark_serving.py \
--backend vllm \ --backend vllm \
--model base_model \ --model base_model \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \ --tokenizer meta-llama/Llama-3.1-8B-Instruct \
@ -318,6 +334,24 @@ pgrep python | xargs kill -9 && pkill -f python
# Test data # Test data
## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s ## **Scenario 1**: 1K input & 1K output tokens, E2E P99 latency ~20s
- **1P5D (6×A800) vs vLLM (1×A800)**:
- Throughput ↑7.2% (1085 → 6979/6)
- ITL (P99) ↓81.3% (120ms → 22.9ms)
- TTFT (P99) ↑26.8% (175ms → 222ms)
- TPOT: No change
![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1) - **1P6D (7×A800) vs vLLM (1×A800)**:
- Throughput ↑9.6% (1085 → 8329/7)
- ITL (P99) ↓81.0% (120ms → 22.7ms)
- TTFT (P99) ↑210% (175ms →543ms)
- TPOT: No change
## **Scenario 2**: 1K input & 200 output tokens, E2E P99 latency ~4s
- **1P1D (2×A800) vs vLLM (1×A800)**:
- Throughput ↑37.4% (537 → 1476/2)
- ITL (P99) ↓81.8% (127ms → 23.1ms)
- TTFT (P99) ↑41.8% (160ms → 227ms)
- TPOT: No change
![testdata](https://github.com/user-attachments/assets/f791bfc7-9f3d-4e5c-9171-a42f9f4da627)

View File

@ -34,22 +34,23 @@ th:not(:first-child) {
} }
</style> </style>
| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | | Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD](spec_decode.md) | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | | [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
| [SD](spec_decode.md) | ✅ | ✅ | | ✅ | | | | | | | | | | | | <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | | ✅ | | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | | ✅ | ✅ | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | | ✅ | ✅ | | | | | | | | | | |
| <abbr title="Pooling Models">pooling</abbr> | | | | | | ✅ | | | | | | | | | | CUDA graph | | | | | | ✅ | | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | [](gh-issue:7366) | | | ✅ | | | | | | | | | <abbr title="Pooling Models">pooling</abbr> | ❌ | ❌ | ❌ | ❌ | | | ✅ | | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | | | | | | | ✅ | ✅ | | | | | | | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | | [](gh-issue:7366) | | | [](gh-issue:7366) | | ✅ | ✅ | | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | | ✅ | | ❌ | ✅ | ✅ | ✅ | | | | | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | | ✅ | | ❌ | ✅ | ✅ | ✅ | | | | | |
| multi-step | | ✅ | | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | | <abbr title="Async Output Processing">async output</abbr> | ✅ | | ✅ | | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
| <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | | ✅ | | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | | multi-step | ❌ | ✅ | ❌ | ✅ | | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
| best-of | ✅ | ✅ | ✅ | [](gh-issue:6137) | | | ✅ | ✅ | ✅ | | [](gh-issue:7968) | ✅ | ✅ | | | <abbr title="Multimodal Inputs">mm</abbr> | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | | | ✅ | ✅ | ✅ | | ✅ | ✅ | ❔ | ✅ | | |
| beam-search | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | | ✅ | | | best-of | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | | ✅ | |
| beam-search | ✅ | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ |
[](){ #feature-x-hardware } [](){ #feature-x-hardware }
@ -58,9 +59,10 @@ th:not(:first-child) {
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ❌ | | <abbr title="Prompt Adapter">prmpt adptr</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8475) | ✅ | ❌ |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ | | <abbr title="Pooling Models">pooling</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |

View File

@ -302,7 +302,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au
return tokenizer.apply_chat_template(chat, tokenize=False) return tokenizer.apply_chat_template(chat, tokenize=False)
llm = LLM( model = LLM(
model=model_id, model=model_id,
enable_lora=True, enable_lora=True,
max_lora_rank=64, max_lora_rank=64,
@ -329,7 +329,7 @@ To this end, we allow registration of default multimodal LoRAs to handle this au
} }
outputs = llm.generate( outputs = model.generate(
inputs, inputs,
sampling_params=SamplingParams( sampling_params=SamplingParams(
temperature=0.2, temperature=0.2,

View File

@ -98,7 +98,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py> Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: If using the [LLM.chat](https://docs.vllm.ai/en/stable/models/generative_models.html#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
```python ```python
from vllm import LLM from vllm import LLM
@ -177,70 +177,6 @@ Multi-image input can be extended to perform video captioning. We show this with
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
instead of using multi-image input. instead of using multi-image input.
Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown in this example using Qwen2.5-VL:
??? code
```python
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
model_path = "Qwen/Qwen2.5-VL-3B-Instruct/"
video_path = "https://content.pexels.com/videos/free-videos.mp4"
llm = LLM(
model=model_path,
gpu_memory_utilization=0.8,
enforce_eager=True,
limit_mm_per_prompt={"video": 1},
)
sampling_params = SamplingParams(
max_tokens=1024,
)
video_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "text", "text": "describe this video."},
{
"type": "video",
"video": video_path,
"total_pixels": 20480 * 28 * 28,
"min_pixels": 16 * 28 * 28
}
]
},
]
messages = video_messages
processor = AutoProcessor.from_pretrained(model_path)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
mm_data = {}
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
```
!!! note
'process_vision_info' is only applicable to Qwen2.5-VL and similar models.
Full example: <gh-file:examples/offline_inference/vision_language.py> Full example: <gh-file:examples/offline_inference/vision_language.py>
### Audio Inputs ### Audio Inputs

View File

@ -6,12 +6,10 @@ Contents:
- [Supported Hardware](supported_hardware.md) - [Supported Hardware](supported_hardware.md)
- [AutoAWQ](auto_awq.md) - [AutoAWQ](auto_awq.md)
- [AutoRound](auto_round.md)
- [BitsAndBytes](bnb.md) - [BitsAndBytes](bnb.md)
- [BitBLAS](bitblas.md) - [BitBLAS](bitblas.md)
- [GGUF](gguf.md) - [GGUF](gguf.md)
- [GPTQModel](gptqmodel.md) - [GPTQModel](gptqmodel.md)
- [INC](inc.md)
- [INT4 W4A16](int4.md) - [INT4 W4A16](int4.md)
- [INT8 W8A8](int8.md) - [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md) - [FP8 W8A8](fp8.md)

View File

@ -1,103 +0,0 @@
# AutoRound
[AutoRound](https://github.com/intel/auto-round) is Intels advanced quantization algorithm designed to produce highly efficient **INT2, INT3, INT4, and INT8**
quantized large language models—striking an optimal balance between accuracy and deployment performance.
AutoRound applies weight-only quantization to transformer-based models, enabling significant memory savings and faster
inference while maintaining near-original accuracy. It supports a wide range of hardware platforms, including **CPUs,
Intel GPUs, HPUs, and CUDA-enabled devices**.
Please refer to the [AutoRound guide](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md) for more details.
Key Features:
**AutoRound, AutoAWQ, AutoGPTQ, and GGUF** are supported
**10+ vision-language models (VLMs)** are supported
**Per-layer mixed-bit quantization** for fine-grained control
**RTN (Round-To-Nearest) mode** for quick quantization with slight accuracy loss
**Multiple quantization recipes**: best, base, and light
✅ Advanced utilities such as immediate packing and support for **10+ backends**
## Installation
```bash
uv pip install auto-round
```
## Quantizing a model
For VLMs, please change to `auto-round-mllm` in CLI usage and `AutoRoundMLLM` in API usage.
### CLI usage
```bash
auto-round \
--model Qwen/Qwen3-0.6B \
--bits 4 \
--group_size 128 \
--format "auto_round" \
--output_dir ./tmp_autoround
```
```bash
auto-round \
--model Qwen/Qwen3-0.6B \
--format "gguf:q4_k_m" \
--output_dir ./tmp_autoround
```
### API usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
model_name = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
# the best accuracy, 4-5X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
# 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym )
output_dir = "./tmp_autoround"
# format= 'auto_round'(default), 'auto_gptq', 'auto_awq'
autoround.quantize_and_save(output_dir, format="auto_round")
```
## Running a quantized model with vLLM
Here is some example code to run auto-round format in vLLM:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
# Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and
ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

View File

@ -5,7 +5,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
!!! note !!! note
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`). Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper. Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
For details see [supported hardware](supported_hardware.md). For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
Below are the steps to utilize BitBLAS with vLLM. Below are the steps to utilize BitBLAS with vLLM.

View File

@ -86,9 +86,8 @@ Load and run the model in `vllm`:
```python ```python
from vllm import LLM from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
llm = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic") result = model.generate("Hello my name is")
result = llm.generate("Hello my name is")
print(result[0].outputs[0].text) print(result[0].outputs[0].text)
``` ```
@ -126,10 +125,9 @@ In this mode, all Linear modules (except for the final `lm_head`) have their wei
```python ```python
from vllm import LLM from vllm import LLM
model = LLM("facebook/opt-125m", quantization="fp8")
llm = LLM("facebook/opt-125m", quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB # INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result = llm.generate("Hello, my name is") result = model.generate("Hello, my name is")
print(result[0].outputs[0].text) print(result[0].outputs[0].text)
``` ```

View File

@ -1,56 +0,0 @@
---
title: FP8 INC
---
[](){ #inc }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerators.
Currently, quantization is validated only in Llama models.
Intel Gaudi supports quantization of various modules and functions, including, but not limited to `Linear`, `KVCache`, `Matmul` and `Softmax`. For more information, please refer to:
[Supported Modules\\Supported Functions\\Custom Patched Modules](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-modules).
!!! note
Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the [vllm-hpu-extention](https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md) package.
!!! note
`QUANT_CONFIG` is an environment variable that points to the measurement or quantization [JSON config file](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options).
The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference.
## Run Online Inference Using FP8
Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
```bash
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
```
!!! tip
If you are just prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop.
!!! tip
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables:
`VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
`VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
## Run Offline Inference Using FP8
To run offline inference (after completing the model calibration process):
* Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode.
* Pass `quantization=inc` and `kv_cache_dtype=fp8_inc` as parameters to the `LLM` object.
* Call shutdown method of the model_executor at the end of the run.
```python
from vllm import LLM
llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc")
...
# Call llm.generate on the required prompts and sampling params.
...
llm.llm_engine.model_executor.shutdown()
```
## Device for the Model's Weights Uploading
The unquantized weights are first loaded onto the CPU, then quantized and transferred to the target device (HPU) for model execution.
This reduces the device memory footprint of model weights, as only quantized weights are stored in the device memory.

View File

@ -108,8 +108,7 @@ After quantization, you can load and run the model in vLLM:
```python ```python
from vllm import LLM from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
llm = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
``` ```
To evaluate accuracy, you can use `lm_eval`: To evaluate accuracy, you can use `lm_eval`:

View File

@ -114,8 +114,7 @@ After quantization, you can load and run the model in vLLM:
```python ```python
from vllm import LLM from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
llm = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
``` ```
To evaluate accuracy, you can use `lm_eval`: To evaluate accuracy, you can use `lm_eval`:

View File

@ -2,19 +2,18 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU | | Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Neuron | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------| |-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | | ✅︎ | ❌ | ❌ | | AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | | GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | | INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ | | FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | | GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware. - ✅︎ indicates that the quantization method is supported on the specified hardware.

Some files were not shown because too many files have changed in this diff Show More