Tony Lee

Autoscaling and Queueing Behavior

How much traffic must be buffered while GPU capacity and model readiness catch up?

Project

GPU Inference Decision Lab

Focus

Capacity response

Run shape

4 cases / 1 profile

Evidence

Supported: Admission behavior

Why it matters

Compares direct and bounded-queue client policies during burst and spike-to-zero traffic while GPU capacity and model readiness catch up.

Autoscaling is more than replica count. The May 7 spike reruns show fast GPU node launch and slow container/model readiness.

Supported

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Supported: Admission behaviorLatest reports: 2026-05-07

Scale-from-zero queue behavior

Autoscaling reports compare direct and bounded-queue clients across burst and spike-to-zero cases. Bounded queues delivered all work at about 2s p95; direct clients ran hotter but dropped 237-787 iterations.

GPU node ready

35s

NodeClaim in 3-12s; pod scheduled at 65s

Model ready

425-439s

container start at 354-357s dominated the cold path

Queued delivery

100%

burst-queued and spike-queued dropped 0 client iterations

Queue policy outcome

Direct clients maximize attempted throughput but shed work; bounded queues protect delivery and tail latency by limiting concurrency.

Case: burst-direct

Delivery: 76.98% delivered

p95 latency: 14.62s

Dropped / active: 787 dropped / 255 active

GPU max: 80%

Case: burst-queued

Delivery: 100% delivered

p95 latency: 2.19s

Dropped / active: 0 dropped / 24 active

GPU max: 86%

Case: spike-direct

Delivery: 88.14% delivered

p95 latency: 14.23s

Dropped / active: 237 dropped / 254 active

GPU max: 80%

Case: spike-queued

Delivery: 100% delivered

p95 latency: 1.98s

Dropped / active: 0 dropped / 20 active

GPU max: 84%

Spike cold-start timeline

The May 7 spike reruns captured the scale-from-zero timeline and show model readiness as the long pole.

Spike case: spike-direct

NodeClaim: 3s

GPU node: 35s

Container / model: 354s / 425s

Pod scheduled: 65s

Spike case: spike-queued

NodeClaim: 12s

GPU node: 35s

Container / model: 357s / 439s

Pod scheduled: 65s

Evidence boundary

Burst cases predate timeline capture, and first-successful-completion timing is missing. Use spike runs for cold-start timing and all cases for queue policy.

  • The scale-from-zero path was dominated by container/image/model readiness, not NodeClaim or GPU node creation.
  • The cluster tried spot capacity first, but EC2 Spot service-linked-role permissions blocked spot replacement; on-demand nodes served the successful runs.
  • First-successful-completion timing is not in the selected reports.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedAdmission + readiness

Bounded admission

Use bounded admission when requests can arrive before model readiness.

100% queued delivery; direct clients dropped 237-787 iterations.

Decision record
PartialAdmission + readiness

Cold-start readiness

Optimize readiness before treating node launch as the cold-start bottleneck.

NodeClaim and GPU node arrival were fast; image, container, and model readiness drove the 425-439s wait.

Decision record
PartialCost + autoscaling

Active-pressure target

Keep active-pressure HPA testing, but do not treat target 8 as optimal.

Targets 2/4/6/8 were all underutilized.

Decision record

Run shape

Run shape

4 cases across 512 prompt tokens and 128 output tokens, paired with 1 run profile.

Cases

4 cases

burst-direct

open-loop overloaded burst that exposes request drops while capacity scales

512 prompt tokens128 output tokens

burst-queued

closed-loop burst that approximates bounded client backpressure

512 prompt tokens128 output tokens

spike-direct

open-loop spike to zero for provisioning and cooldown timing

512 prompt tokens128 output tokens

spike-queued

closed-loop spike to zero with bounded queued admission

512 prompt tokens128 output tokens

Run profiles

1 profile

default

default checked-in serving profile for autoscaling comparisons

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Scaling

first NodeClaimfirst GPU nodepod scheduledcontainer startedmodel readyfirst successful completion

Queueing

dropped iterationsbuffering requiredfailure attributionpeak active requests

Latency and GPU

p95 request latencyp99 request latencyaverage GPU utilizationmax GPU utilization

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show autoscaling

Example live command

./scripts/experiment run --experiment autoscaling --case spike-queued --profile default