Add load pattern configuration guide to benchmarks (#26886)

Signed-off-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Matvei Pashkovskii <matvei.pashkovskii@amd.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-28 19:49:15 +02:00
parent e3d8186666
commit 130aa8cbcf
2 changed files with 67 additions and 0 deletions
--- a/docs/assets/contributing/load-pattern-examples.png
+++ b/docs/assets/contributing/load-pattern-examples.png
--- a/docs/contributing/benchmarks.md
+++ b/docs/contributing/benchmarks.md
@ -321,6 +321,73 @@ The following arguments can be used to control the ramp-up:
 - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
 - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
 ##### Load Pattern Configuration
 vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
 ###### Load Pattern Control Parameters
 - `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
 - `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
 - `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
 These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
 The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
 - Shape parameter: `burstiness` value
 - Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
 - Traffic characteristics:
    - `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing
    - `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation  
    - `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing
 ![Load Pattern Examples](../assets/contributing/load-pattern-examples.png)
 *Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
 Load Pattern Recommendations by Use Case:
 | Use Case           | Burstiness   | Request Rate    | Max Concurrency | Description                                               |
 | ---                | ---          | ---             | ---             | ---                                                       |
 | Maximum Throughput | N/A          | Infinite        | Limited         | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
 | Realistic Testing  | 1.0          | Moderate (5-20) | Infinite        | Natural Poisson traffic patterns for baseline performance |
 | Stress Testing     | 0.1-0.5      | High (20-100)   | Infinite        | Challenging burst patterns to test resilience             |
 | Latency Profiling  | 2.0-5.0      | Low (1-10)      | Infinite        | Uniform load for consistent timing analysis               |
 | Capacity Planning  | 1.0          | Variable        | Limited         | Test resource limits with realistic constraints           |
 | SLA Validation     | 1.0          | Target rate     | SLA limit       | Production-like constraints for compliance testing        |
 These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
 The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
 - Users send requests as fast as they can (infinite rate)
 - A load balancer or API gateway controls the maximum concurrent connections
 - The system operates at its concurrency limit, revealing true throughput capacity
 - `--burstiness` has no effect since request timing is not controlled when rate is infinite
 This pattern helps determine optimal concurrency settings for your production load balancer configuration.
 To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
 ```text
 GPU KV cache size: 15,728,640 tokens
 Maximum concurrency for 8,192 tokens per request: 1920
 ```
 Where:
 - GPU KV cache size: Total tokens that can be cached across all concurrent requests
 - Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
 - Calculation: `max_concurrency = kv_cache_size / max_model_len`
 Using KV cache metrics for load pattern configuration:
 - For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints
 - For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity
 - For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates
 - Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths
 </details>
 #### 📈 Offline Throughput Benchmark