KV Cache vs Concurrency
How does longer prompt context change concurrency and throughput?
Project
GPU Inference Decision Lab
Focus
Memory pressure
Run shape
3 cases / 5 profiles
Evidence
Supported: Long-context knee / Rejected: FP8 KV on g4dn
Why it matters
Compares prompt lengths, sweeps the 8192/300 knee, then tests whether scheduler caps fix it.
Long prompts can look healthy until arrival rate crosses a narrow boundary. The May 18 server-timing follow-up attributes the tail to queue and TTFT inflation, not decode.
Supported
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Queueing sets the boundary
With 8192-token prompts and 300 generated tokens, the profile is usable through 1.10 req/s, queues repeatably at 1.15 req/s, and is queue-dominated by 1.20 req/s. Server timing shows admission removes queue and TTFT inflation while decode stays roughly unchanged.
Stable repeat
1.10 req/s
0 waiting; p95 25.23s; p95 queue 0.285s
Queue starts
1.15 req/s
16 waiting; p95 42.49s; p95 queue 14.02s
Queue dominated
1.20 req/s
100% delivered; p95 63.40s; p95 queue 36.93s
Cap variants
0 wins
seqs-16, seqs-24, and batched-16384 did not beat baseline
Admission result
27.98s p95
59 unserved; p95 queue 0.285s; 32 active
8192/300 rate sweep
The default long-context profile still completes every request at 1.20 req/s, but repeated queue delay and p95 make that point an operational edge.
Target: 0.75 req/s
Outcome: stable
p95 latency: 6.61s
Peak waiting: 0 waiting / 5 active
GPU max: 96%
Target: 1.00 req/s
Outcome: stable but slower
p95 latency: 11.92s
Peak waiting: 0 waiting / 12 active
GPU max: 100%
Target: 1.05 req/s
Outcome: clean but slower
p95 latency: 14.31s
Peak waiting: 0 waiting / 16 active
GPU max: 100%
Target: 1.10 req/s
Outcome: stable repeat
p95 latency: 25.23s
Peak waiting: 0 waiting / 28 active
GPU max: 100%
Target: 1.15 req/s
Outcome: queueing repeats
p95 latency: 42.49s
Peak waiting: 16 waiting / 48 active
GPU max: 100%
Target: 1.20 req/s
Outcome: queue-dominated
p95 latency: 63.40s
Peak waiting: 39 waiting / 71 active
GPU max: 100%
Target: 1.25 req/s
Outcome: saturation is obvious
p95 latency: 77.51s
Peak waiting: 57 waiting / 89 active
GPU max: 100%
Target: 1.50 req/s
Outcome: overloaded
p95 latency: 180.27s
Peak waiting: 181 waiting / 213 active
GPU max: 100%
Server timing attribution
Queue delay and TTFT inflate together as load crosses the boundary; decode remains around 29.4s, and admission removes queue inflation.
Run: 1.10 req/s r3
Outcome: stable repeat
p95 request: 25.23s
p95 queue / TTFT: 0.285s / 0.733s
p95 decode: 29.32s
Run: 1.15 req/s r3
Outcome: queueing repeats
p95 request: 42.49s
p95 queue / TTFT: 14.02s / 18.12s
p95 decode: 29.41s
Run: 1.20 req/s r3
Outcome: queue-dominated
p95 request: 63.40s
p95 queue / TTFT: 36.93s / 37.61s
p95 decode: 29.43s
Run: 1.25 req/s direct
Outcome: saturation begins
p95 request: 77.51s
p95 queue / TTFT: 48.11s / 71.24s
p95 decode: 29.44s
Run: 1.25 admission-032
Outcome: bounded admission
p95 request: 27.98s
p95 queue / TTFT: 0.285s / 0.718s
p95 decode: 29.39s
Long-context fix attempts
The 1.20 req/s scheduler variants did not beat the baseline; bounded admission at 1.25 req/s made overload explicit and lowered p95.
Profile / run: long-context @ 1.20
Outcome: baseline practical edge
p95 latency: 54.35s
Waiting / active: 30 waiting / 62 active
GPU max: 100%
Profile / run: seqs-16 @ 1.20
Outcome: worse tail and waiting
p95 latency: 76.24s
Waiting / active: 66 waiting / 82 active
GPU max: 97%
Profile / run: seqs-24 @ 1.20
Outcome: worse tail and waiting
p95 latency: 61.36s
Waiting / active: 45 waiting / 69 active
GPU max: 100%
Profile / run: batched-16384 @ 1.20
Outcome: no improvement
p95 latency: 55.58s
Waiting / active: 31 waiting / 63 active
GPU max: 100%
Profile / run: admission-032 @ 1.25
Outcome: bounded admission
p95 latency: 27.98s
Waiting / active: 0 waiting / 32 active
GPU max: 100%
Evidence boundary
Treat this as an operating band for one model, GPU class, vLLM image, and 8192/300 workload. Use admission/backpressure before deeper scheduler-cap tuning on the current g4dn/vLLM path, and compare queue delay, TTFT, dropped demand, and p95 request latency together.
- At 1.20 req/s, the latest repeats delivered 100% of offered work with zero failures, drops, or interruptions, but p95 request latency reached 62.66-63.40s.
- The 1.20 req/s r3 report shows p95 server queue delay at 36.93s and p95 TTFT at 37.61s while p95 decode stays near 29.43s.
- At the same rate, seqs-16 hit 76.24s p95, seqs-24 hit 61.36s, and batched-16384 hit 55.58s.
- The latest 1.25 req/s admission-capped run exposed 59 unserved iterations, reduced p95 to 27.98s, and kept p95 server queue delay at 0.285s.
Selected reports
Generated reports behind this summary.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Long-context boundary
Set a concurrency or admission boundary for 8192/300 traffic.
1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.
View decision record →Long-context scheduler caps
Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.
seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.
View decision record →FP8 KV on g4dn
Do not select FP8 KV for this current long-context path.
47.58-69.12% delivery versus 100% baseline.
View decision record →Run shape
Run shape
3 cases across 512-8,192 prompt tokens and 100-300 output tokens, paired with 5 run profiles.
Cases
3 cases
prompt-512-output-100
short prompt baseline
prompt-2048-output-200
medium prompt pressure
prompt-8192-output-300
long prompt KV-cache pressure
Run profiles
5 profiles
default
default checked-in serving profile
long-context
8k context profile for KV-cache pressure tests
long-context-seqs-24
reduced sequence cap that worsened the 1.20 req/s tail
long-context-seqs-16
more conservative sequence cap that increased waiting pressure
long-context-batched-16384
larger batched-token budget that did not improve the knee
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Concurrency
Latency and throughput
GPU memory
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show kv-cacheExample live command
./scripts/experiment run --experiment kv-cache --case prompt-8192-output-300-rate-120 --profile long-context