GPU Inference Decisions

Architecture calls derived from EKS/vLLM serving measurements, with each decision tied back to the experiment evidence that supports, rejects, or bounds it.

View experiments Project overview GitHub

Supported

Partial

Rejected

Blocked

Decision matrix

Architecture Decisions

Each decision is grouped by domain and links back to the experiments that produced the evidence.

Domain

Admission + readiness

Bounded admission

Supported

Use bounded admission when requests can arrive before model readiness.

100% queued delivery; direct clients dropped 237-787 iterations.

Request Pattern Utilization Autoscaling and Queueing Behavior

Admission + readiness

Cold-start readiness

Partial

Optimize readiness before treating node launch as the cold-start bottleneck.

NodeClaim and GPU node arrival were fast; image, container, and model readiness drove the 425-439s wait.

Next evidence

Capture first-successful-completion timing across the selected cold-start reports.

Autoscaling and Queueing Behavior

Domain

Long-context scheduling

Long-context boundary

Supported

Set a concurrency or admission boundary for 8192/300 traffic.

1.20 req/s still delivers 100%, but repeats 36.8s p95 queue delay.

KV Cache vs Concurrency Prefill vs Decode Timing

Long-context scheduling

Long-context scheduler caps

Rejected

Do not use seq caps or larger batched-token caps as the first 1.20 req/s fix.

seqs-16 hit 76.24s p95; seqs-24 hit 61.36s; batched-16384 hit 55.58s.

KV Cache vs Concurrency Batching Scheduler Tradeoffs

Long-context scheduling

Small-request scheduler

Supported

Keep vLLM dynamic defaults for current 512/128 steady and burst traffic.

Dynamic default kept the best delivery and token throughput.

Batching Scheduler Tradeoffs Request Pattern Utilization

Domain

Cost + autoscaling

Useful-work cost

Supported

Use batching for small-request economics, but gate burst SLO claims.

$0.019752/1K steady optimized; burst optimized p95 still 10.91s.

Cost per Useful Work Batching Scheduler Tradeoffs

Cost + autoscaling

Active-pressure target

Partial

Keep active-pressure HPA testing, but do not treat target 8 as optimal.

Targets 2/4/6/8 were all underutilized.

Next evidence

Run a higher-pressure HPA sweep that reaches clear GPU utilization separation.

Autoscaling and Queueing Behavior

Domain

Quantization + hardware

FP8 KV on g4dn

Rejected

Do not select FP8 KV for this current long-context path.

47.58-69.12% delivery versus 100% baseline.

KV Cache vs Concurrency

Quantization + hardware

Blackwell FP4

Blocked

Hold the FP4 architecture decision until B200 results exist.

EC2 UnfulfillableCapacity; no quantized artifact produced.

Next evidence

Rerun once B200 capacity can produce a comparable quantized artifact.

FP4 Quantization Optimization

Evidence visuals

Decision evidence visuals

Local visuals recreate the key proof points without depending on the lab repository at runtime.

Long-context knee

The 8192/300 workload reaches a practical edge before failures appear; repeated server queue delay is the warning sign.

1.10 req/s

stable repeat

p95 latency25.23s

Peak waiting0 waiting

1.15 req/s

queueing repeats

p95 latency42.49s

Peak waiting16 waiting

1.20 req/s

queue-dominated repeat

p95 latency63.40s

Peak waiting39 waiting

1.25 req/s

saturation begins

p95 latency77.51s

Peak waiting57 waiting

Local static readout from May 18 8192/300 queue-timing reports

Queue attribution

vLLM timing separates queue, TTFT, and decode; admission removes queue inflation while decode stays roughly unchanged.

1.10 req/s

stable repeat; p95 decode 29.32s

p95 queue0.285s

p95 TTFT0.733s

1.15 req/s

queueing repeat; p95 decode 29.41s

p95 queue14.02s

p95 TTFT18.12s

1.20 req/s

queue-dominated; p95 decode 29.43s

p95 queue36.93s

p95 TTFT37.61s

1.25 admission

59 unserved; p95 decode 29.39s

p95 queue0.285s

p95 TTFT0.718s

Server timing from May 17-18 direct and admission-capped long-context reports

Long-context fix attempt

Scheduler-cap variants did not move the 1.20 req/s knee; bounded admission made overload explicit and lowered p95.

baseline @ 1.20

edge: 30 waiting / 62 active

p95 latency54.35s

Peak waiting30 waiting

seqs-16 @ 1.20

worse: 66 waiting / 82 active

p95 latency76.24s

Peak waiting66 waiting

seqs-24 @ 1.20

worse: 45 waiting / 69 active

p95 latency61.36s

Peak waiting45 waiting

batched-16384 @ 1.20

no improvement: 31 waiting / 63 active

p95 latency55.58s

Peak waiting31 waiting

admission-032 @ 1.25

bounded: 59 unserved / 32 active

p95 latency27.98s

Peak waiting0 waiting

Local static readout from May 15-18 KV-cache reports

Cost per useful work

Batching makes small-request serving cheaper, but the burst result still needs admission or more capacity before it is latency-safe.

steady naive

low useful work

Cost / 1K successful$0.137976

p95 latency60.31s

steady optimized

SLO pass

Cost / 1K successful$0.019752

p95 latency1.61s

burst naive

failed profile

Cost / 1K successful$0.164137

p95 latency120.00s

burst optimized

cheap, SLO miss

Cost / 1K successful$0.012768

p95 latency10.91s

Local static readout from curated cost reports