TL

KV Cache vs Concurrency

How does longer prompt context change concurrency and throughput?

Project

GPU Inference Decision Lab

Focus

Memory pressure

Run shape

3 cases / 5 profiles

Evidence

Supported: Long-context knee / Rejected: FP8 KV on g4dn

Why it matters

Compares prompt lengths, sweeps the 8192/300 knee, then tests whether scheduler caps fix it.

Long prompts can look healthy until arrival rate crosses a narrow boundary. The May 18 server-timing follow-up attributes the tail to queue and TTFT inflation, not decode.

Supported

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Supported: Long-context kneeLatest reports: 2026-05-18 UTCRejected: FP8 KV on g4dn

Queueing sets the boundary

With 8192-token prompts and 300 generated tokens, the profile is usable through 1.10 req/s, queues repeatably at 1.15 req/s, and is queue-dominated by 1.20 req/s. Server timing shows admission removes queue and TTFT inflation while decode stays roughly unchanged.

Stable repeat

1.10 req/s

0 waiting; p95 25.23s; p95 queue 0.285s

Queue starts

1.15 req/s

16 waiting; p95 42.49s; p95 queue 14.02s

Queue dominated

1.20 req/s

100% delivered; p95 63.40s; p95 queue 36.93s

Cap variants

0 wins

seqs-16, seqs-24, and batched-16384 did not beat baseline

Admission result

27.98s p95

59 unserved; p95 queue 0.285s; 32 active

8192/300 rate sweep

The default long-context profile still completes every request at 1.20 req/s, but repeated queue delay and p95 make that point an operational edge.

Target: 0.75 req/s

Outcome: stable

p95 latency: 6.61s

Peak waiting: 0 waiting / 5 active

GPU max: 96%

Target: 1.00 req/s

Outcome: stable but slower

p95 latency: 11.92s

Peak waiting: 0 waiting / 12 active

GPU max: 100%

Target: 1.05 req/s

Outcome: clean but slower

p95 latency: 14.31s

Peak waiting: 0 waiting / 16 active

GPU max: 100%

Target: 1.10 req/s

Outcome: stable repeat

p95 latency: 25.23s

Peak waiting: 0 waiting / 28 active

GPU max: 100%

Target: 1.15 req/s

Outcome: queueing repeats

p95 latency: 42.49s

Peak waiting: 16 waiting / 48 active

GPU max: 100%

Target: 1.20 req/s

Outcome: queue-dominated

p95 latency: 63.40s

Peak waiting: 39 waiting / 71 active

GPU max: 100%

Target: 1.25 req/s

Outcome: saturation is obvious

p95 latency: 77.51s

Peak waiting: 57 waiting / 89 active

GPU max: 100%

Target: 1.50 req/s

Outcome: overloaded

p95 latency: 180.27s

Peak waiting: 181 waiting / 213 active

GPU max: 100%

Server timing attribution

Queue delay and TTFT inflate together as load crosses the boundary; decode remains around 29.4s, and admission removes queue inflation.

Run: 1.10 req/s r3

Outcome: stable repeat

p95 request: 25.23s

p95 queue / TTFT: 0.285s / 0.733s

p95 decode: 29.32s

Run: 1.15 req/s r3

Outcome: queueing repeats

p95 request: 42.49s

p95 queue / TTFT: 14.02s / 18.12s

p95 decode: 29.41s

Run: 1.20 req/s r3

Outcome: queue-dominated

p95 request: 63.40s

p95 queue / TTFT: 36.93s / 37.61s

p95 decode: 29.43s

Run: 1.25 req/s direct

Outcome: saturation begins

p95 request: 77.51s

p95 queue / TTFT: 48.11s / 71.24s

p95 decode: 29.44s

Run: 1.25 admission-032

Outcome: bounded admission

p95 request: 27.98s

p95 queue / TTFT: 0.285s / 0.718s

p95 decode: 29.39s

Long-context fix attempts

The 1.20 req/s scheduler variants did not beat the baseline; bounded admission at 1.25 req/s made overload explicit and lowered p95.

Profile / run: long-context @ 1.20

Outcome: baseline practical edge

p95 latency: 54.35s

Waiting / active: 30 waiting / 62 active

GPU max: 100%

Profile / run: seqs-16 @ 1.20

Outcome: worse tail and waiting

p95 latency: 76.24s

Waiting / active: 66 waiting / 82 active

GPU max: 97%

Profile / run: seqs-24 @ 1.20

Outcome: worse tail and waiting

p95 latency: 61.36s

Waiting / active: 45 waiting / 69 active

GPU max: 100%

Profile / run: batched-16384 @ 1.20

Outcome: no improvement

p95 latency: 55.58s

Waiting / active: 31 waiting / 63 active

GPU max: 100%

Profile / run: admission-032 @ 1.25

Outcome: bounded admission

p95 latency: 27.98s

Waiting / active: 0 waiting / 32 active

GPU max: 100%

Evidence boundary

Treat this as an operating band for one model, GPU class, vLLM image, and 8192/300 workload. Use admission/backpressure before deeper scheduler-cap tuning on the current g4dn/vLLM path, and compare queue delay, TTFT, dropped demand, and p95 request latency together.

  • At 1.20 req/s, the latest repeats delivered 100% of offered work with zero failures, drops, or interruptions, but p95 request latency reached 62.66-63.40s.
  • The 1.20 req/s r3 report shows p95 server queue delay at 36.93s and p95 TTFT at 37.61s while p95 decode stays near 29.43s.
  • At the same rate, seqs-16 hit 76.24s p95, seqs-24 hit 61.36s, and batched-16384 hit 55.58s.
  • The latest 1.25 req/s admission-capped run exposed 59 unserved iterations, reduced p95 to 27.98s, and kept p95 server queue delay at 0.285s.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedLong-context scheduling

Long-context boundary

Set a concurrency or admission boundary for 8192/300 traffic.

1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.

View decision record →
RejectedLong-context scheduling

Long-context scheduler caps

Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.

seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.

View decision record →
RejectedQuantization + hardware

FP8 KV on g4dn

Do not select FP8 KV for this current long-context path.

47.58-69.12% delivery versus 100% baseline.

View decision record →

Run shape

Run shape

3 cases across 512-8,192 prompt tokens and 100-300 output tokens, paired with 5 run profiles.

Cases

3 cases

prompt-512-output-100

short prompt baseline

512 prompt tokens100 output tokens

prompt-2048-output-200

medium prompt pressure

2,048 prompt tokens200 output tokens

prompt-8192-output-300

long prompt KV-cache pressure

8,192 prompt tokens300 output tokens

Run profiles

5 profiles

default

default checked-in serving profile

long-context

8k context profile for KV-cache pressure tests

long-context-seqs-24

reduced sequence cap that worsened the 1.20 req/s tail

long-context-seqs-16

more conservative sequence cap that increased waiting pressure

long-context-batched-16384

larger batched-token budget that did not improve the knee

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Concurrency

max stable concurrencyrequest failuresOOM events

Latency and throughput

p95 request latencyp99 request latencyrequests/secgeneration tokens/sec

GPU memory

average GPU utilizationmax GPU utilizationGPU memory usedGPU memory free

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show kv-cache

Example live command

./scripts/experiment run --experiment kv-cache --case prompt-8192-output-300-rate-120 --profile long-context