Autoscaling and Queueing Behavior
How much traffic must be buffered while GPU capacity and model readiness catch up?
Project
GPU Inference Decision Lab
Focus
Capacity response
Run shape
4 cases / 1 profile
Evidence
Supported: Admission behavior
Why it matters
Compares direct and bounded-queue client policies during burst and spike-to-zero traffic while GPU capacity and model readiness catch up.
Autoscaling is more than replica count. The May 7 spike reruns show fast GPU node launch and slow container/model readiness.
Supported
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Scale-from-zero queue behavior
Autoscaling reports compare direct and bounded-queue clients across burst and spike-to-zero cases. Bounded queues delivered all work at about 2s p95; direct clients ran hotter but dropped 237-787 iterations.
GPU node ready
35s
NodeClaim in 3-12s; pod scheduled at 65s
Model ready
425-439s
container start at 354-357s dominated the cold path
Queued delivery
100%
burst-queued and spike-queued dropped 0 client iterations
Queue policy outcome
Direct clients maximize attempted throughput but shed work; bounded queues protect delivery and tail latency by limiting concurrency.
Case: burst-direct
Delivery: 76.98% delivered
p95 latency: 14.62s
Dropped / active: 787 dropped / 255 active
GPU max: 80%
Case: burst-queued
Delivery: 100% delivered
p95 latency: 2.19s
Dropped / active: 0 dropped / 24 active
GPU max: 86%
Case: spike-direct
Delivery: 88.14% delivered
p95 latency: 14.23s
Dropped / active: 237 dropped / 254 active
GPU max: 80%
Case: spike-queued
Delivery: 100% delivered
p95 latency: 1.98s
Dropped / active: 0 dropped / 20 active
GPU max: 84%
Spike cold-start timeline
The May 7 spike reruns captured the scale-from-zero timeline and show model readiness as the long pole.
Spike case: spike-direct
NodeClaim: 3s
GPU node: 35s
Container / model: 354s / 425s
Pod scheduled: 65s
Spike case: spike-queued
NodeClaim: 12s
GPU node: 35s
Container / model: 357s / 439s
Pod scheduled: 65s
Evidence boundary
Burst cases predate timeline capture, and first-successful-completion timing is missing. Use spike runs for cold-start timing and all cases for queue policy.
- The scale-from-zero path was dominated by container/image/model readiness, not NodeClaim or GPU node creation.
- The cluster tried spot capacity first, but EC2 Spot service-linked-role permissions blocked spot replacement; on-demand nodes served the successful runs.
- First-successful-completion timing is not in the selected reports.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Bounded admission
Use bounded admission when requests can arrive before model readiness.
100% queued delivery; direct clients dropped 237-787 iterations.
Decision recordCold-start readiness
Optimize readiness before treating node launch as the cold-start bottleneck.
NodeClaim and GPU node arrival were fast; image, container, and model readiness drove the 425-439s wait.
Decision recordActive-pressure target
Keep active-pressure HPA testing, but do not treat target 8 as optimal.
Targets 2/4/6/8 were all underutilized.
Decision recordRun shape
Run shape
4 cases across 512 prompt tokens and 128 output tokens, paired with 1 run profile.
Cases
4 cases
burst-direct
open-loop overloaded burst that exposes request drops while capacity scales
burst-queued
closed-loop burst that approximates bounded client backpressure
spike-direct
open-loop spike to zero for provisioning and cooldown timing
spike-queued
closed-loop spike to zero with bounded queued admission
Run profiles
1 profile
default
default checked-in serving profile for autoscaling comparisons
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Scaling
Queueing
Latency and GPU
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show autoscalingExample live command
./scripts/experiment run --experiment autoscaling --case spike-queued --profile default