TL

GPU Inference Decision Lab

An EKS/vLLM lab that turns serving measurements into architecture decisions for admission, autoscaling, context limits, scheduling, and quantization.

Serving infrastructureMeasured decision record7 experiments

EKS/vLLM measurements support admission, long-context boundaries, scheduler defaults, useful-work cost, and FP8 KV rejection.

At a glance

One public request path, one observable scale path, and an evidence gate that separates supported decisions from open work.

Platform

AWS EKS + vLLM

OpenAI-compatible serving on Karpenter-managed GPU nodes.

Scale signal

Prometheus -> HPA

Serving pressure drives replica targets and pending GPU pods.

Baseline

0 serving GPUs

Serving capacity starts from zero for cold-start proof.

Workflows

Verify / Evaluate / Experiment

Pick the command path that matches the question.

Catalog

7 experiments

Definitions for memory, latency, batching, traffic, autoscaling, cost, and quantization.

Results

Decided + open

Evidence marks supported calls, rejected options, and open gaps.

Pick the right workflow

Verify the path, evaluate platform behavior, or run catalog experiments; each workflow feeds the decision record.

Workflow

Verify

Proves GPU node launch, vLLM readiness, public /v1 success, and cleanup back to zero.

Run after ./scripts/up for a fast path check.

GPU node appearsPublic /v1 responseCleanup to zero
./scripts/verify
Source

Workflow

Evaluate

Compares cold start, warm capacity, HPA policy, active-pressure targets, and resilience drills.

Use for scale signals, profile choices, and target tuning.

Markdown reportJSON reportDecision evidence
./scripts/evaluate --profile zero-idle
Source

Workflow

Experiment

Runs catalog studies for workload cases, serving profiles, metrics, and generated reports.

Use validate/render locally; use run after ./scripts/up for live measurements.

Catalog validationRendered manifestsGenerated reports
./scripts/experiment validate
Source

Default path

Bring up the platform, prove one public response, then clean it up.

01

Start the lab

./scripts/up

Provision platform pieces before serving starts.

02

Prove the path

./scripts/verify

Confirm one successful public /v1 inference response.

03

Tear it down

./scripts/down

Remove the workload and return serving GPUs to zero.

What the quick start proves

GPU capacity appears

A serving GPU node launches for the workload instead of assuming capacity is already present.

Public inference works

The deployment reaches Ready and returns one successful public /v1 response.

Cleanup returns to zero

Cleanup removes the workload and confirms serving GPU node count returns to zero.

Decision contract

Catalog experiments ask the question; evidence gates decide whether the answer is supported, rejected, or still open.

WorkloadProfileRunEvidence gateDecision

How serving scales

Requests stay on one public path while metrics drive capacity in a separate control path.

Foundation

After ./scripts/up, ingress, observability, and GPU admission are ready while serving GPU nodes stay at zero.

Setup via scripts/upCodeSystem nodesPublic ingress hostnameCodePrometheusDocsAdapter / custom metrics APIDocsGPU NodePoolsDocsGPU nodes: 0

Serve path

Public requests follow the same edge-to-ready-replica path.

InternetALBIngressCodeServiceCodeReady vLLM podCode

Scale path

Serving pressure becomes custom metrics, HPA targets, pending pods, and Karpenter GPU capacity.

vLLM metricsCodePrometheusDocsAdapterDocsHPA desired replicasCodePending GPU podKarpenter / NodeClaimDocsGPU nodeSecond Ready vLLM podCode

Rejoin point

New replicas join the same Service, so the public path stays stable as capacity grows.

Second Ready vLLM podCodeSame ServiceCode

Architecture Readout

Workload measurements map to direct serving calls, with the boundary called out when the evidence is partial or blocked.

Admission

Use bounded admission when traffic can arrive before the model is ready.

Queued burst and spike runs delivered 100%; direct clients dropped work.

Supported

Long context

Set a long-context boundary before failures appear.

1.15 req/s repeats queueing; 1.20 req/s repeats 36.8s p95 server queue delay.

Supported

LC caps

Do not use scheduler caps as the first fix for the 1.20 req/s long-context knee.

seqs-16, seqs-24, and batched-16384 were worse or unchanged versus baseline p95.

Rejected

Scheduler

Keep vLLM dynamic defaults for small steady and burst traffic.

Explicit sequence and batched-token caps under-delivered on the 512/128 matrix.

Supported

Cost

Treat cheap burst runs as incomplete unless latency passes.

Optimized batching lowers cost per useful request; burst p95 still misses the SLO.

Caveated

Autoscaling

Keep active-pressure HPA in the matrix, but do not call target 8 production-optimal.

All target 2/4/6/8 sweep points stayed underutilized.

Partial

KV dtype

Reject FP8 KV on the current g4dn/vLLM path.

FP8 KV reduced delivery and tokens/sec versus the stable baseline.

Rejected

FP4

Hold Blackwell FP4 until B200 capacity produces comparable runs.

The p6-b200 live attempt was blocked before a quantized artifact existed.

Blocked