Add load pattern configuration guide to benchmarks (#26886)
Signed-off-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Matvei Pashkovskii <matvei.pashkovskii@amd.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e3d8186666
commit
130aa8cbcf
BIN
docs/assets/contributing/load-pattern-examples.png
Normal file
BIN
docs/assets/contributing/load-pattern-examples.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 577 KiB |
@ -321,6 +321,73 @@ The following arguments can be used to control the ramp-up:
|
|||||||
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
||||||
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
||||||
|
|
||||||
|
##### Load Pattern Configuration
|
||||||
|
|
||||||
|
vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
|
||||||
|
|
||||||
|
###### Load Pattern Control Parameters
|
||||||
|
|
||||||
|
- `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
|
||||||
|
- `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
|
||||||
|
- `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
|
||||||
|
|
||||||
|
These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
|
||||||
|
|
||||||
|
The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
|
||||||
|
|
||||||
|
- Shape parameter: `burstiness` value
|
||||||
|
- Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
|
||||||
|
- Traffic characteristics:
|
||||||
|
- `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing
|
||||||
|
- `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation
|
||||||
|
- `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
*Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
|
||||||
|
|
||||||
|
Load Pattern Recommendations by Use Case:
|
||||||
|
|
||||||
|
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
|
||||||
|
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
|
||||||
|
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
|
||||||
|
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
|
||||||
|
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
|
||||||
|
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
|
||||||
|
|
||||||
|
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
|
||||||
|
|
||||||
|
The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
|
||||||
|
|
||||||
|
- Users send requests as fast as they can (infinite rate)
|
||||||
|
- A load balancer or API gateway controls the maximum concurrent connections
|
||||||
|
- The system operates at its concurrency limit, revealing true throughput capacity
|
||||||
|
- `--burstiness` has no effect since request timing is not controlled when rate is infinite
|
||||||
|
|
||||||
|
This pattern helps determine optimal concurrency settings for your production load balancer configuration.
|
||||||
|
|
||||||
|
To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
|
||||||
|
|
||||||
|
```text
|
||||||
|
GPU KV cache size: 15,728,640 tokens
|
||||||
|
Maximum concurrency for 8,192 tokens per request: 1920
|
||||||
|
```
|
||||||
|
|
||||||
|
Where:
|
||||||
|
|
||||||
|
- GPU KV cache size: Total tokens that can be cached across all concurrent requests
|
||||||
|
- Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
|
||||||
|
- Calculation: `max_concurrency = kv_cache_size / max_model_len`
|
||||||
|
|
||||||
|
Using KV cache metrics for load pattern configuration:
|
||||||
|
|
||||||
|
- For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints
|
||||||
|
- For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity
|
||||||
|
- For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates
|
||||||
|
- Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
#### 📈 Offline Throughput Benchmark
|
#### 📈 Offline Throughput Benchmark
|
||||||
|
|||||||
Reference in New Issue
Block a user