Tony Lee

Prefill vs Decode Timing

How do prompt-heavy and decode-heavy requests change TTFT and inter-token timing?

Project

GPU Inference Decision Lab

Focus

Streaming latency

Run shape

3 cases / 4 profiles

Evidence

Selected report: Streaming split

Why it matters

Uses a streaming client to separate time to first token from inter-token latency across prompt-heavy and decode-heavy shapes.

Similar total durations can stress different serving phases. Splitting prefill from decode makes streaming UX easier to reason about.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Streaming splitLatest reports: 2026-05-06

Mixed streaming shape split

The mixed streaming run completed 640 requests at concurrency 24, split by prefill-heavy versus decode-heavy shapes.

Mixed run

12.52s p95

640 streamed requests at concurrency 24

Prefill shape

1.52s p95

303 mixed-run requests; p95 TTFT 495 ms

Decode shape

12.79s p95

337 mixed-run requests; p95 TTFT 443 ms

Mixed run shape split

The mixed result hides two latency profiles: prefill-heavy requests finish quickly, while decode-heavy requests dominate total latency.

Shape: prefill-heavy

Completed: 303 completed

p95 total: 1.52s

p95 TTFT: 495 ms

Tokens/sec: 69.03

Shape: decode-heavy

Completed: 337 completed

p95 total: 12.79s

p95 TTFT: 443 ms

Tokens/sec: 66.97

Isolated shape baseline

The isolated concurrency-16 runs show the single-shape TTFT and total-latency contrast before mixed-workload interference.

Case: prefill-heavy

Timing split: Longer TTFT, short total

p95 total: 2.24s

p95 TTFT: 1.37s

GPU max: 95%

Case: decode-heavy

Timing split: Fast TTFT, longer total

p95 total: 8.41s

p95 TTFT: 149 ms

GPU max: 84%

Mixed profile follow-up

The profile variants show that capping sequence count or batched tokens did not improve the mixed-shape envelope.

Profile: default

Run shape: 640 samples / 24 concurrency

p95 total: 12.52s

p99 total: 12.93s

GPU max: 93%

Profile: max-seqs-16

Run shape: 640 samples / 32 concurrency

p95 total: 17.56s

p99 total: 18.57s

GPU max: 97%

Profile: max-seqs-8

Run shape: 640 samples / 32 concurrency

p95 total: 29.88s

p99 total: 133.64s

GPU max: 90%

Profile: batched-tokens-4096

Run shape: 640 samples / 32 concurrency

p95 total: 20.89s

p99 total: 125.51s

GPU max: 91%

Evidence boundary

The mixed run is the best page-level evidence because it exercises both shapes together. Isolated runs are baselines; max-seqs and batched-token variants still need a curated conclusion.

  • The mixed default run had 1 peak waiting request at 24 active streams and 93% max GPU utilization.
  • The max-seqs-8 and batched-tokens-4096 variants produced much worse p99 latency in the 640-sample comparison.
  • Cost and SLO fields were not collected for these streaming reports.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedLong-context scheduling

Long-context boundary

Set a concurrency or admission boundary for 8192/300 traffic.

1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.

Decision record

Run shape

Run shape

3 cases across 128-1,536 prompt tokens and 64-768 output tokens, paired with 4 run profiles.

Cases

3 cases

prefill-heavy

long prompt short output

1,536 prompt tokens64 output tokens

decode-heavy

short prompt long output

128 prompt tokens768 output tokens

mixed-prefill-decode

50/50 streamed mix of prompt-heavy and decode-heavy requests

1,536 prompt tokens768 output tokens

Run profiles

4 profiles

default

default checked-in serving profile

max-seqs-16

sequence cap for mixed streaming comparison

max-seqs-8

smaller active-set cap for mixed streaming comparison

batched-tokens-4096

batched-token cap without a sequence-count cap

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Streaming latency

p50 TTFTp95 TTFTp50 inter-token latencyp95 inter-token latency

Request latency

p95 request latencyp99 request latencygeneration tokens/sec

GPU

average GPU utilizationmax GPU utilization

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show prefill-decode

Example live command

./scripts/experiment run-stream --experiment prefill-decode --case mixed-prefill-decode --profile default --samples 640 --concurrency 24