Batching Scheduler Tradeoffs
How do vLLM scheduler limits trade throughput for p95/p99 latency?
Project
GPU Inference Decision Lab
Focus
Scheduler behavior
Run shape
2 cases / 3 profiles
Evidence
Selected report: Scheduler matrix
Why it matters
Compares constrained, limited, and default vLLM scheduler settings under steady and burst small-request traffic.
Scheduler caps should earn their keep. In the current 512/128 matrix, explicit caps shed useful work before they improve the latency envelope.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Dynamic defaults beat explicit caps
For steady and burst 512/128 traffic, the default vLLM scheduler delivered the most useful work with the best tail-latency profile among the tested options.
Steady default
100% delivered
p95 1.66s; 948.98 generated tokens/sec
Burst default
97.6% delivered
p95 6.79s; peak waiting 1
Caps under-delivered
8.4-80.1%
delivery range for limited and constrained profiles
Steady scheduler profile comparison
All three reports use the same steady 512/128 workload, making the scheduler limit tradeoff easy to compare.
Profile: dynamic-default
Outcome: Full delivery, no waiting
p95 latency: 1.66s
Delivery: 100%
GPU max: 85%
Profile: limited-batching
Outcome: Queueing and dropped work
p95 latency: 10.55s
Delivery: 80.1%
GPU max: 85%
Profile: constrained-scheduler
Outcome: Severe underdelivery
p95 latency: 59.72s
Delivery: 15.6%
GPU max: 88%
Burst scheduler profile comparison
Under burst pressure, the default profile still keeps delivery near 98% while explicit caps shed much more work.
Profile: dynamic-default
Outcome: Best burst profile
p95 latency: 6.79s
Delivery: 97.6%
Peak waiting: 1
Profile: limited-batching
Outcome: Queue-limited
p95 latency: 20.53s
Delivery: 45.5%
Peak waiting: 120
Profile: constrained-scheduler
Outcome: Overloaded reference
p95 latency: 119.09s
Delivery: 8.4%
Peak waiting: 127
Evidence boundary
This supports dynamic defaults for the current small homogeneous workload. Mixed-size and fairness-oriented runs are still needed before making a general scheduler-cap policy.
- Dynamic-default means vLLM's default scheduler settings, not disabled batching.
- The matrix covers 512/128 steady and burst traffic; it does not prove behavior for long-context or mixed-size workloads.
- Cost denominators are covered by the separate cost experiment, not this scheduler matrix.
Selected reports
Generated reports behind this summary.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Long-context scheduler caps
Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.
seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.
View decision record →Small-request scheduler
Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.
Dynamic default kept the best delivery and token throughput.
View decision record →Useful-work cost
Use batching for small-request economics, but gate burst SLO claims.
$0.019752/1K steady optimized; burst optimized p95 still 10.91s.
View decision record →Run shape
Run shape
2 cases across 512 prompt tokens and 128 output tokens, paired with 3 run profiles.
Cases
2 cases
steady-512-output-128
steady homogeneous requests for scheduler profile comparison
burst-512-output-128
burst traffic to expose queueing and p99 latency
Run profiles
3 profiles
constrained-scheduler
one active sequence with a minimal batched-token budget
limited-batching
moderate sequence and batched-token limits
dynamic-default
vLLM default dynamic scheduler settings
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Serving
Cost
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show batchingExample live command
./scripts/experiment run --experiment batching --case steady-512-output-128 --profile dynamic-default