Request Pattern Utilization
How do steady, burst, uneven-size, and spike-to-zero traffic patterns affect GPU occupancy?
Project
GPU Inference Decision Lab
Focus
Traffic shape
Run shape
4 cases / 1 profile
Evidence
Selected report: Pattern matrix
Why it matters
Runs steady, burst, uneven-size, and spike-to-zero traffic against the same default profile.
Average load can hide the operating problem. The same serving profile behaves differently when traffic arrives steadily, in bursts, or as a mixed-size workload.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Traffic shape changes the result
The default profile stayed clean under steady traffic, but burst and spike-to-zero traffic pushed active concurrency to the edge and dropped client-side work.
Steady traffic
1.29s p95
100% delivered; peak active 7
Burst traffic
87.5% delivered
8.56s p95; peak active 127
Uneven mix
99.7% delivered
7.87s p95; mixed tail widened
Default-profile traffic matrix
The same serving profile has different delivery and latency behavior depending on traffic shape.
Pattern: steady-small
Delivery: 100.0%
p95 latency: 1.29s
Peak active: 7
Avg / max GPU: 69.0% / 84%
Pattern: burst-small
Delivery: 87.5%
p95 latency: 8.56s
Peak active: 127
Avg / max GPU: 77.3% / 79%
Pattern: uneven-size-mix
Delivery: 99.7%
p95 latency: 7.87s
Peak active: 25
Avg / max GPU: 65.2% / 84%
Pattern: spike-to-zero
Delivery: 79.8%
p95 latency: 8.45s
Peak active: 128
Avg / max GPU: 75.5% / 76%
Evidence boundary
These reports use direct clients with no admission buffer. Dropped iterations are unserved client work, not successful backpressure handling.
- Steady 512/128 traffic delivered 100% of work with p95 near 1.3s.
- Burst and spike-to-zero traffic reached 127-128 active requests and dropped work.
- The uneven-size mix preserved delivery but widened the tail; the report does not include per-shape latency buckets.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Bounded admission
Use bounded admission when requests can arrive before model readiness.
100% queued delivery; direct clients dropped 237-787 iterations.
Decision recordSmall-request scheduler
Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.
Dynamic default kept the best delivery and token throughput.
Decision recordRun shape
Run shape
4 cases across 512-1,536 prompt tokens and 128-512 output tokens, paired with 1 run profile.
Cases
4 cases
steady-small
steady homogeneous traffic for utilization baseline
burst-small
short burst that stresses queueing and tail latency
uneven-size-mix
weighted mix of short, medium, and long requests
spike-to-zero
rapid spike followed by scale-to-zero pressure
Run profiles
1 profile
default
default checked-in serving profile for traffic pattern comparison
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Serving
GPU and cost
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show request-patternsExample live command
./scripts/experiment run --experiment request-patterns --case steady-small --profile default