GPU Inference Decision Lab
An EKS/vLLM lab that turns serving measurements into architecture decisions for admission, autoscaling, context limits, scheduling, and quantization.
EKS/vLLM measurements support admission, long-context boundaries, scheduler defaults, useful-work cost, and FP8 KV rejection.
At a glance
One public request path, one observable scale path, and an evidence gate that separates supported decisions from open work.
Platform
AWS EKS + vLLM
OpenAI-compatible serving on Karpenter-managed GPU nodes.
Scale signal
Prometheus -> HPA
Serving pressure drives replica targets and pending GPU pods.
Baseline
0 serving GPUs
Serving capacity starts from zero for cold-start proof.
Workflows
Verify / Evaluate / Experiment
Pick the command path that matches the question.
Catalog
7 experiments
Definitions for memory, latency, batching, traffic, autoscaling, cost, and quantization.
Results
Decided + open
Evidence marks supported calls, rejected options, and open gaps.
Pick the right workflow
Verify the path, evaluate platform behavior, or run catalog experiments; each workflow feeds the decision record.
Workflow
Verify
Proves GPU node launch, vLLM readiness, public /v1 success, and cleanup back to zero.
Run after ./scripts/up for a fast path check.
./scripts/verifySourceWorkflow
Evaluate
Compares cold start, warm capacity, HPA policy, active-pressure targets, and resilience drills.
Use for scale signals, profile choices, and target tuning.
./scripts/evaluate --profile zero-idleSourceWorkflow
Experiment
Runs catalog studies for workload cases, serving profiles, metrics, and generated reports.
Use validate/render locally; use run after ./scripts/up for live measurements.
./scripts/experiment validateSourceDefault path
Bring up the platform, prove one public response, then clean it up.
Start the lab
./scripts/upProvision platform pieces before serving starts.
Prove the path
./scripts/verifyConfirm one successful public /v1 inference response.
Tear it down
./scripts/downRemove the workload and return serving GPUs to zero.
What the quick start proves
GPU capacity appears
A serving GPU node launches for the workload instead of assuming capacity is already present.
Public inference works
The deployment reaches Ready and returns one successful public /v1 response.
Cleanup returns to zero
Cleanup removes the workload and confirms serving GPU node count returns to zero.
Decision contract
Catalog experiments ask the question; evidence gates decide whether the answer is supported, rejected, or still open.
How serving scales
Requests stay on one public path while metrics drive capacity in a separate control path.
Foundation
After ./scripts/up, ingress, observability, and GPU admission are ready while serving GPU nodes stay at zero.
Serve path
Public requests follow the same edge-to-ready-replica path.
Scale path
Serving pressure becomes custom metrics, HPA targets, pending pods, and Karpenter GPU capacity.
Rejoin point
New replicas join the same Service, so the public path stays stable as capacity grows.
Architecture Readout
Workload measurements map to direct serving calls, with the boundary called out when the evidence is partial or blocked.
Admission
Use bounded admission when traffic can arrive before the model is ready.
Queued burst and spike runs delivered 100%; direct clients dropped work.
Long context
Set a long-context boundary before failures appear.
1.15 req/s repeats queueing; 1.20 req/s repeats 36.8s p95 server queue delay.
LC caps
Do not use scheduler caps as the first fix for the 1.20 req/s long-context knee.
seqs-16, seqs-24, and batched-16384 were worse or unchanged versus baseline p95.
Scheduler
Keep vLLM dynamic defaults for small steady and burst traffic.
Explicit sequence and batched-token caps under-delivered on the 512/128 matrix.
Cost
Treat cheap burst runs as incomplete unless latency passes.
Optimized batching lowers cost per useful request; burst p95 still misses the SLO.
Autoscaling
Keep active-pressure HPA in the matrix, but do not call target 8 production-optimal.
All target 2/4/6/8 sweep points stayed underutilized.
KV dtype
Reject FP8 KV on the current g4dn/vLLM path.
FP8 KV reduced delivery and tokens/sec versus the stable baseline.
FP4
Hold Blackwell FP4 until B200 capacity produces comparable runs.
The p6-b200 live attempt was blocked before a quantized artifact existed.