Prefill vs Decode Timing
How do prompt-heavy and decode-heavy requests change TTFT and inter-token timing?
Project
GPU Inference Decision Lab
Focus
Streaming latency
Run shape
3 cases / 4 profiles
Evidence
Selected report: Streaming split
Why it matters
Uses a streaming client to separate time to first token from inter-token latency across prompt-heavy and decode-heavy shapes.
Similar total durations can stress different serving phases. Splitting prefill from decode makes streaming UX easier to reason about.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Mixed streaming shape split
The mixed streaming run completed 640 requests at concurrency 24, split by prefill-heavy versus decode-heavy shapes.
Mixed run
12.52s p95
640 streamed requests at concurrency 24
Prefill shape
1.52s p95
303 mixed-run requests; p95 TTFT 495 ms
Decode shape
12.79s p95
337 mixed-run requests; p95 TTFT 443 ms
Mixed run shape split
The mixed result hides two latency profiles: prefill-heavy requests finish quickly, while decode-heavy requests dominate total latency.
Shape: prefill-heavy
Completed: 303 completed
p95 total: 1.52s
p95 TTFT: 495 ms
Tokens/sec: 69.03
Shape: decode-heavy
Completed: 337 completed
p95 total: 12.79s
p95 TTFT: 443 ms
Tokens/sec: 66.97
Isolated shape baseline
The isolated concurrency-16 runs show the single-shape TTFT and total-latency contrast before mixed-workload interference.
Case: prefill-heavy
Timing split: Longer TTFT, short total
p95 total: 2.24s
p95 TTFT: 1.37s
GPU max: 95%
Case: decode-heavy
Timing split: Fast TTFT, longer total
p95 total: 8.41s
p95 TTFT: 149 ms
GPU max: 84%
Mixed profile follow-up
The profile variants show that capping sequence count or batched tokens did not improve the mixed-shape envelope.
Profile: default
Run shape: 640 samples / 24 concurrency
p95 total: 12.52s
p99 total: 12.93s
GPU max: 93%
Profile: max-seqs-16
Run shape: 640 samples / 32 concurrency
p95 total: 17.56s
p99 total: 18.57s
GPU max: 97%
Profile: max-seqs-8
Run shape: 640 samples / 32 concurrency
p95 total: 29.88s
p99 total: 133.64s
GPU max: 90%
Profile: batched-tokens-4096
Run shape: 640 samples / 32 concurrency
p95 total: 20.89s
p99 total: 125.51s
GPU max: 91%
Evidence boundary
The mixed run is the best page-level evidence because it exercises both shapes together. Isolated runs are baselines; max-seqs and batched-token variants still need a curated conclusion.
- The mixed default run had 1 peak waiting request at 24 active streams and 93% max GPU utilization.
- The max-seqs-8 and batched-tokens-4096 variants produced much worse p99 latency in the 640-sample comparison.
- Cost and SLO fields were not collected for these streaming reports.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Long-context boundary
Set a concurrency or admission boundary for 8192/300 traffic.
1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.
Decision recordRun shape
Run shape
3 cases across 128-1,536 prompt tokens and 64-768 output tokens, paired with 4 run profiles.
Cases
3 cases
prefill-heavy
long prompt short output
decode-heavy
short prompt long output
mixed-prefill-decode
50/50 streamed mix of prompt-heavy and decode-heavy requests
Run profiles
4 profiles
default
default checked-in serving profile
max-seqs-16
sequence cap for mixed streaming comparison
max-seqs-8
smaller active-set cap for mixed streaming comparison
batched-tokens-4096
batched-token cap without a sequence-count cap
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Streaming latency
Request latency
GPU
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show prefill-decodeExample live command
./scripts/experiment run-stream --experiment prefill-decode --case mixed-prefill-decode --profile default --samples 640 --concurrency 24