GPU Inference Decisions
Architecture calls derived from EKS/vLLM serving measurements, with each decision tied back to the experiment evidence that supports, rejects, or bounds it.
Supported
4
Partial
2
Rejected
2
Blocked
1
Decision matrix
Architecture Decisions
Each decision is grouped by domain and links back to the experiments that produced the evidence.
Domain
Admission + readiness
Admission + readiness
Bounded admission
Use bounded admission when requests can arrive before model readiness.
100% queued delivery; direct clients dropped 237-787 iterations.
Admission + readiness
Cold-start readiness
Optimize readiness before treating node launch as the cold-start bottleneck.
NodeClaim and GPU node arrival were fast; image, container, and model readiness drove the 425-439s wait.
Next evidence
Capture first-successful-completion timing across the selected cold-start reports.
Domain
Long-context scheduling
Long-context scheduling
Long-context boundary
Set a concurrency or admission boundary for 8192/300 traffic.
1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.
Long-context scheduling
Long-context scheduler caps
Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.
seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.
Long-context scheduling
Small-request scheduler
Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.
Dynamic default kept the best delivery and token throughput.
Domain
Cost + autoscaling
Cost + autoscaling
Useful-work cost
Use batching for small-request economics, but gate burst SLO claims.
$0.019752/1K steady optimized; burst optimized p95 still 10.91s.
Cost + autoscaling
Active-pressure target
Keep active-pressure HPA testing, but do not treat target 8 as optimal.
Targets 2/4/6/8 were all underutilized.
Next evidence
Run a higher-pressure HPA sweep that reaches clear GPU utilization separation.
Domain
Quantization + hardware
Quantization + hardware
FP8 KV on g4dn
Do not select FP8 KV for this current long-context path.
47.58-69.12% delivery versus 100% baseline.
Quantization + hardware
Blackwell FP4
Hold the FP4 architecture decision until B200 results exist.
EC2 UnfulfillableCapacity; no quantized artifact produced.
Next evidence
Rerun once B200 capacity can produce a comparable quantized artifact.
Evidence visuals
Decision evidence visuals
Local visuals recreate the key proof points without depending on the lab repository at runtime.
Long-context knee
The 8192/300 workload reaches a practical edge before failures appear; repeated server queue delay is the warning sign.
1.10 req/s
stable repeat
1.15 req/s
queueing repeats
1.20 req/s
queue-dominated repeat
1.25 req/s
saturation begins
Local static readout from May 18 8192/300 queue-timing reports
Queue attribution
vLLM timing separates queue, TTFT, and decode; admission removes queue inflation while decode stays roughly unchanged.
1.10 req/s
stable repeat; p95 decode 29.32s
1.15 req/s
queueing repeat; p95 decode 29.41s
1.20 req/s
queue-dominated; p95 decode 29.43s
1.25 admission
59 unserved; p95 decode 29.39s
Server timing from May 17-18 direct and admission-capped long-context reports
Long-context fix attempt
Scheduler-cap variants did not move the 1.20 req/s knee; bounded admission made overload explicit and lowered p95.
baseline @ 1.20
edge: 30 waiting / 62 active
seqs-16 @ 1.20
worse: 66 waiting / 82 active
seqs-24 @ 1.20
worse: 45 waiting / 69 active
batched-16384 @ 1.20
no improvement: 31 waiting / 63 active
admission-032 @ 1.25
bounded: 59 unserved / 32 active
Local static readout from May 15-18 KV-cache reports
Cost per useful work
Batching makes small-request serving cheaper, but the burst result still needs admission or more capacity before it is latency-safe.
steady naive
low useful work
steady optimized
SLO pass
burst naive
failed profile
burst optimized
cheap, SLO miss
Local static readout from curated cost reports