Prefill vs Decode Timing

How do prompt-heavy and decode-heavy requests change TTFT and inter-token timing?

Experiment catalog Source

Project

GPU Inference Decision Lab

Focus

Streaming latency

Run shape

3 cases / 4 profiles

Evidence

Selected report: Streaming split

Why it matters

Uses a streaming client to separate time to first token from inter-token latency across prompt-heavy and decode-heavy shapes.

Similar total durations can stress different serving phases. Splitting prefill from decode makes streaming UX easier to reason about.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Streaming splitLatest reports: 2026-05-06

Mixed streaming shape split

The mixed streaming run completed 640 requests at concurrency 24, split by prefill-heavy versus decode-heavy shapes.

Mixed run

12.52s p95

640 streamed requests at concurrency 24

Prefill shape

1.52s p95

303 mixed-run requests; p95 TTFT 495 ms

Decode shape

12.79s p95

337 mixed-run requests; p95 TTFT 443 ms

Mixed run shape split

The mixed result hides two latency profiles: prefill-heavy requests finish quickly, while decode-heavy requests dominate total latency.

ShapeCompletedp95 totalp95 TTFTTokens/sec

Shape: prefill-heavy

Completed: 303 completed

p95 total: 1.52s

p95 TTFT: 495 ms

Tokens/sec: 69.03

Shape: decode-heavy

Completed: 337 completed

p95 total: 12.79s

p95 TTFT: 443 ms

Tokens/sec: 66.97

Isolated shape baseline

The isolated concurrency-16 runs show the single-shape TTFT and total-latency contrast before mixed-workload interference.

CaseTiming splitp95 totalp95 TTFTGPU max

Case: prefill-heavy

Timing split: Longer TTFT, short total

p95 total: 2.24s

p95 TTFT: 1.37s

GPU max: 95%

Case: decode-heavy

Timing split: Fast TTFT, longer total

p95 total: 8.41s

p95 TTFT: 149 ms

GPU max: 84%

Mixed profile follow-up

The profile variants show that capping sequence count or batched tokens did not improve the mixed-shape envelope.

ProfileRun shapep95 totalp99 totalGPU max

Profile: default

Run shape: 640 samples / 24 concurrency

p95 total: 12.52s

p99 total: 12.93s

GPU max: 93%

Profile: max-seqs-16

Run shape: 640 samples / 32 concurrency

p95 total: 17.56s

p99 total: 18.57s

GPU max: 97%

Profile: max-seqs-8

Run shape: 640 samples / 32 concurrency

p95 total: 29.88s

p99 total: 133.64s

GPU max: 90%

Profile: batched-tokens-4096

Run shape: 640 samples / 32 concurrency

p95 total: 20.89s

p99 total: 125.51s

GPU max: 91%

Evidence boundary

The mixed run is the best page-level evidence because it exercises both shapes together. Isolated runs are baselines; max-seqs and batched-token variants still need a curated conclusion.

The mixed default run had 1 peak waiting request at 24 active streams and 93% max GPU utilization.
The max-seqs-8 and batched-tokens-4096 variants produced much worse p99 latency in the 640-sample comparison.
Cost and SLO fields were not collected for these streaming reports.

Mixed run report Prefill isolated report Decode isolated report Results template Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedLong-context scheduling

Long-context boundary

Set a concurrency or admission boundary for 8192/300 traffic.

1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.

Decision record

Run shape

3 cases across 128-1,536 prompt tokens and 64-768 output tokens, paired with 4 run profiles.

Cases

3 cases

prefill-heavy

long prompt short output

1,536 prompt tokens64 output tokens

decode-heavy

short prompt long output

128 prompt tokens768 output tokens

mixed-prefill-decode

50/50 streamed mix of prompt-heavy and decode-heavy requests

1,536 prompt tokens768 output tokens

Run profiles

4 profiles

default

default checked-in serving profile

max-seqs-16

sequence cap for mixed streaming comparison

max-seqs-8

smaller active-set cap for mixed streaming comparison

batched-tokens-4096

batched-token cap without a sequence-count cap

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Streaming latency

p50 TTFTp95 TTFTp50 inter-token latencyp95 inter-token latency

Request latency

p95 request latencyp99 request latencygeneration tokens/sec

GPU

average GPU utilizationmax GPU utilization

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show prefill-decode

Example live command

./scripts/experiment run-stream --experiment prefill-decode --case mixed-prefill-decode --profile default --samples 640 --concurrency 24

Source Results template