TL

Batching Scheduler Tradeoffs

How do vLLM scheduler limits trade throughput for p95/p99 latency?

Project

GPU Inference Decision Lab

Focus

Scheduler behavior

Run shape

2 cases / 3 profiles

Evidence

Selected report: Scheduler matrix

Why it matters

Compares constrained, limited, and default vLLM scheduler settings under steady and burst small-request traffic.

Scheduler caps should earn their keep. In the current 512/128 matrix, explicit caps shed useful work before they improve the latency envelope.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Scheduler matrixLatest reports: 2026-05-14

Dynamic defaults beat explicit caps

For steady and burst 512/128 traffic, the default vLLM scheduler delivered the most useful work with the best tail-latency profile among the tested options.

Steady default

100% delivered

p95 1.66s; 948.98 generated tokens/sec

Burst default

97.6% delivered

p95 6.79s; peak waiting 1

Caps under-delivered

8.4-80.1%

delivery range for limited and constrained profiles

Steady scheduler profile comparison

All three reports use the same steady 512/128 workload, making the scheduler limit tradeoff easy to compare.

Profile: dynamic-default

Outcome: Full delivery, no waiting

p95 latency: 1.66s

Delivery: 100%

GPU max: 85%

Profile: limited-batching

Outcome: Queueing and dropped work

p95 latency: 10.55s

Delivery: 80.1%

GPU max: 85%

Profile: constrained-scheduler

Outcome: Severe underdelivery

p95 latency: 59.72s

Delivery: 15.6%

GPU max: 88%

Burst scheduler profile comparison

Under burst pressure, the default profile still keeps delivery near 98% while explicit caps shed much more work.

Profile: dynamic-default

Outcome: Best burst profile

p95 latency: 6.79s

Delivery: 97.6%

Peak waiting: 1

Profile: limited-batching

Outcome: Queue-limited

p95 latency: 20.53s

Delivery: 45.5%

Peak waiting: 120

Profile: constrained-scheduler

Outcome: Overloaded reference

p95 latency: 119.09s

Delivery: 8.4%

Peak waiting: 127

Evidence boundary

This supports dynamic defaults for the current small homogeneous workload. Mixed-size and fairness-oriented runs are still needed before making a general scheduler-cap policy.

  • Dynamic-default means vLLM's default scheduler settings, not disabled batching.
  • The matrix covers 512/128 steady and burst traffic; it does not prove behavior for long-context or mixed-size workloads.
  • Cost denominators are covered by the separate cost experiment, not this scheduler matrix.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

RejectedLong-context scheduling

Long-context scheduler caps

Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.

seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.

View decision record →
SupportedLong-context scheduling

Small-request scheduler

Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.

Dynamic default kept the best delivery and token throughput.

View decision record →
SupportedCost + autoscaling

Useful-work cost

Use batching for small-request economics, but gate burst SLO claims.

$0.019752/1K steady optimized; burst optimized p95 still 10.91s.

View decision record →

Run shape

Run shape

2 cases across 512 prompt tokens and 128 output tokens, paired with 3 run profiles.

Cases

2 cases

steady-512-output-128

steady homogeneous requests for scheduler profile comparison

512 prompt tokens128 output tokens

burst-512-output-128

burst traffic to expose queueing and p99 latency

512 prompt tokens128 output tokens

Run profiles

3 profiles

constrained-scheduler

one active sequence with a minimal batched-token budget

limited-batching

moderate sequence and batched-token limits

dynamic-default

vLLM default dynamic scheduler settings

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 request latencyp95 request latencyp99 request latencyp50 TTFTp95 TTFT

Serving

requests/secgeneration tokens/secpeak waiting requestspeak running requestspeak active requests

Cost

cost per 1K successful requestscost per 1M generated tokens

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show batching

Example live command

./scripts/experiment run --experiment batching --case steady-512-output-128 --profile dynamic-default