Tony Lee

Request Pattern Utilization

How do steady, burst, uneven-size, and spike-to-zero traffic patterns affect GPU occupancy?

Project

GPU Inference Decision Lab

Focus

Traffic shape

Run shape

4 cases / 1 profile

Evidence

Selected report: Pattern matrix

Why it matters

Runs steady, burst, uneven-size, and spike-to-zero traffic against the same default profile.

Average load can hide the operating problem. The same serving profile behaves differently when traffic arrives steadily, in bursts, or as a mixed-size workload.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Pattern matrixLatest reports: 2026-05-15

Traffic shape changes the result

The default profile stayed clean under steady traffic, but burst and spike-to-zero traffic pushed active concurrency to the edge and dropped client-side work.

Steady traffic

1.29s p95

100% delivered; peak active 7

Burst traffic

87.5% delivered

8.56s p95; peak active 127

Uneven mix

99.7% delivered

7.87s p95; mixed tail widened

Default-profile traffic matrix

The same serving profile has different delivery and latency behavior depending on traffic shape.

Pattern: steady-small

Delivery: 100.0%

p95 latency: 1.29s

Peak active: 7

Avg / max GPU: 69.0% / 84%

Pattern: burst-small

Delivery: 87.5%

p95 latency: 8.56s

Peak active: 127

Avg / max GPU: 77.3% / 79%

Pattern: uneven-size-mix

Delivery: 99.7%

p95 latency: 7.87s

Peak active: 25

Avg / max GPU: 65.2% / 84%

Pattern: spike-to-zero

Delivery: 79.8%

p95 latency: 8.45s

Peak active: 128

Avg / max GPU: 75.5% / 76%

Evidence boundary

These reports use direct clients with no admission buffer. Dropped iterations are unserved client work, not successful backpressure handling.

  • Steady 512/128 traffic delivered 100% of work with p95 near 1.3s.
  • Burst and spike-to-zero traffic reached 127-128 active requests and dropped work.
  • The uneven-size mix preserved delivery but widened the tail; the report does not include per-shape latency buckets.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedAdmission + readiness

Bounded admission

Use bounded admission when requests can arrive before model readiness.

100% queued delivery; direct clients dropped 237-787 iterations.

Decision record
SupportedLong-context scheduling

Small-request scheduler

Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.

Dynamic default kept the best delivery and token throughput.

Decision record

Run shape

Run shape

4 cases across 512-1,536 prompt tokens and 128-512 output tokens, paired with 1 run profile.

Cases

4 cases

steady-small

steady homogeneous traffic for utilization baseline

512 prompt tokens128 output tokens

burst-small

short burst that stresses queueing and tail latency

512 prompt tokens128 output tokens

uneven-size-mix

weighted mix of short, medium, and long requests

1,536 prompt tokens512 output tokens

spike-to-zero

rapid spike followed by scale-to-zero pressure

512 prompt tokens128 output tokens

Run profiles

1 profile

default

default checked-in serving profile for traffic pattern comparison

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 request latencyp95 request latencyp99 request latency

Serving

requests/secgeneration tokens/secpeak waiting requestspeak running requestspeak active requests

GPU and cost

average GPU utilizationmax GPU utilizationcost per 1K successful requestscost per 1M generated tokens

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show request-patterns

Example live command

./scripts/experiment run --experiment request-patterns --case steady-small --profile default