TL

GPU Inference Decisions

Architecture calls derived from EKS/vLLM serving measurements, with each decision tied back to the experiment evidence that supports, rejects, or bounds it.

Supported

4

Partial

2

Rejected

2

Blocked

1

Decision matrix

Architecture Decisions

Each decision is grouped by domain and links back to the experiments that produced the evidence.

Domain

Admission + readiness

Admission + readiness

Bounded admission

Supported

Use bounded admission when requests can arrive before model readiness.

100% queued delivery; direct clients dropped 237-787 iterations.

Admission + readiness

Cold-start readiness

Partial

Optimize readiness before treating node launch as the cold-start bottleneck.

NodeClaim and GPU node arrival were fast; image, container, and model readiness drove the 425-439s wait.

Next evidence

Capture first-successful-completion timing across the selected cold-start reports.

Domain

Long-context scheduling

Long-context scheduling

Long-context boundary

Supported

Set a concurrency or admission boundary for 8192/300 traffic.

1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.

Long-context scheduling

Long-context scheduler caps

Rejected

Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.

seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.

Long-context scheduling

Small-request scheduler

Supported

Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.

Dynamic default kept the best delivery and token throughput.

Domain

Cost + autoscaling

Cost + autoscaling

Useful-work cost

Supported

Use batching for small-request economics, but gate burst SLO claims.

$0.019752/1K steady optimized; burst optimized p95 still 10.91s.

Cost + autoscaling

Active-pressure target

Partial

Keep active-pressure HPA testing, but do not treat target 8 as optimal.

Targets 2/4/6/8 were all underutilized.

Next evidence

Run a higher-pressure HPA sweep that reaches clear GPU utilization separation.

Domain

Quantization + hardware

Quantization + hardware

FP8 KV on g4dn

Rejected

Do not select FP8 KV for this current long-context path.

47.58-69.12% delivery versus 100% baseline.

Quantization + hardware

Blackwell FP4

Blocked

Hold the FP4 architecture decision until B200 results exist.

EC2 UnfulfillableCapacity; no quantized artifact produced.

Next evidence

Rerun once B200 capacity can produce a comparable quantized artifact.

Evidence visuals

Decision evidence visuals

Local visuals recreate the key proof points without depending on the lab repository at runtime.

Long-context knee

The 8192/300 workload reaches a practical edge before failures appear; repeated server queue delay is the warning sign.

1.10 req/s

stable repeat

p95 latency25.23s
Peak waiting0 waiting

1.15 req/s

queueing repeats

p95 latency42.49s
Peak waiting16 waiting

1.20 req/s

queue-dominated repeat

p95 latency63.40s
Peak waiting39 waiting

1.25 req/s

saturation begins

p95 latency77.51s
Peak waiting57 waiting

Local static readout from May 18 8192/300 queue-timing reports

Queue attribution

vLLM timing separates queue, TTFT, and decode; admission removes queue inflation while decode stays roughly unchanged.

1.10 req/s

stable repeat; p95 decode 29.32s

p95 queue0.285s
p95 TTFT0.733s

1.15 req/s

queueing repeat; p95 decode 29.41s

p95 queue14.02s
p95 TTFT18.12s

1.20 req/s

queue-dominated; p95 decode 29.43s

p95 queue36.93s
p95 TTFT37.61s

1.25 admission

59 unserved; p95 decode 29.39s

p95 queue0.285s
p95 TTFT0.718s

Server timing from May 17-18 direct and admission-capped long-context reports

Long-context fix attempt

Scheduler-cap variants did not move the 1.20 req/s knee; bounded admission made overload explicit and lowered p95.

baseline @ 1.20

edge: 30 waiting / 62 active

p95 latency54.35s
Peak waiting30 waiting

seqs-16 @ 1.20

worse: 66 waiting / 82 active

p95 latency76.24s
Peak waiting66 waiting

seqs-24 @ 1.20

worse: 45 waiting / 69 active

p95 latency61.36s
Peak waiting45 waiting

batched-16384 @ 1.20

no improvement: 31 waiting / 63 active

p95 latency55.58s
Peak waiting31 waiting

admission-032 @ 1.25

bounded: 59 unserved / 32 active

p95 latency27.98s
Peak waiting0 waiting

Local static readout from May 15-18 KV-cache reports

Cost per useful work

Batching makes small-request serving cheaper, but the burst result still needs admission or more capacity before it is latency-safe.

steady naive

low useful work

Cost / 1K successful$0.137976
p95 latency60.31s

steady optimized

SLO pass

Cost / 1K successful$0.019752
p95 latency1.61s

burst naive

failed profile

Cost / 1K successful$0.164137
p95 latency120.00s

burst optimized

cheap, SLO miss

Cost / 1K successful$0.012768
p95 latency10.91s

Local static readout from curated cost reports