Tony Lee

Cost per Useful Work

How much cheaper does the same GPU become when concurrency and batching produce more successful work?

Project

GPU Inference Decision Lab

Focus

Cost efficiency

Run shape

2 cases / 2 profiles

Evidence

Supported: Useful-work cost

Why it matters

Compares naive and optimized serving profiles with cost tied to successful requests and generated tokens.

A cheaper run is not cheaper if it drops work or misses latency goals. This matrix makes cost, useful work, and SLO status visible together.

Supported

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Supported: Useful-work costLatest reports: 2026-05-15

Useful work beats raw cheapness

Optimized batching sharply reduced cost per successful request for steady and burst traffic, but only the steady optimized run passed the latency SLO.

Steady optimized

$0.019752

per 1K successful requests; SLO pass

Naive steady

$0.137976

per 1K successful requests; 2227 dropped

Burst optimized

10.91s p95

$0.012768 per 1K, but SLO miss

Useful-work cost matrix

Optimized batching wins the useful-work denominator, but burst latency still fails the serving SLO.

Run: steady naive

Successful / dropped: 413 / 2227

p95 latency: 60.31s

Cost / 1K: $0.137976

SLO: miss

Run: steady optimized

Successful / dropped: 2670 / 0

p95 latency: 1.61s

Cost / 1K: $0.019752

SLO: pass

Run: burst naive

Successful / dropped: 227 / 2882

p95 latency: 120.00s

Cost / 1K: $0.164137

SLO: miss

Run: burst optimized

Successful / dropped: 2570 / 677

p95 latency: 10.91s

Cost / 1K: $0.012768

SLO: miss

Evidence boundary

Costs include the serving GPU model only. They do not include control plane, networking, storage, observability, idle platform cost, or operator time.

  • The steady optimized profile completed 2670 successful requests with no dropped work and p95 1.61s.
  • The burst optimized profile was cheapest per useful request but still missed latency with p95 10.91s.
  • Burst SLO compliance still needs admission, autoscaling, or a different capacity shape.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedCost + autoscaling

Useful-work cost

Use batching for small-request economics, but gate burst SLO claims.

$0.019752/1K steady optimized; burst optimized p95 still 10.91s.

Decision record

Run shape

Run shape

2 cases across 512 prompt tokens and 128 output tokens, paired with 2 run profiles.

Cases

2 cases

steady-cost-efficiency

steady workload for comparing useful work per serving dollar

512 prompt tokens128 output tokens

burst-cost-efficiency

burst workload for comparing cost efficiency under tail-latency pressure

512 prompt tokens128 output tokens

Run profiles

2 profiles

naive-single

one active sequence reference profile for low useful work per GPU

optimized-batched

higher sequence and batched-token limits for better useful work per GPU

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Useful work

completed requestssuccessful requestsgenerated tokensrequests/secgeneration tokens/sec

Cost

estimated burst costcost per 1K successful requestscost per 1M generated tokens

SLO

p95 request latencyp99 request latencySLO passed

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show cost

Example live command

./scripts/experiment run --experiment cost --case steady-cost-efficiency --profile optimized-batched