Cost per Useful Work
How much cheaper does the same GPU become when concurrency and batching produce more successful work?
Project
GPU Inference Decision Lab
Focus
Cost efficiency
Run shape
2 cases / 2 profiles
Evidence
Supported: Useful-work cost
Why it matters
Compares naive and optimized serving profiles with cost tied to successful requests and generated tokens.
A cheaper run is not cheaper if it drops work or misses latency goals. This matrix makes cost, useful work, and SLO status visible together.
Supported
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Useful work beats raw cheapness
Optimized batching sharply reduced cost per successful request for steady and burst traffic, but only the steady optimized run passed the latency SLO.
Steady optimized
$0.019752
per 1K successful requests; SLO pass
Naive steady
$0.137976
per 1K successful requests; 2227 dropped
Burst optimized
10.91s p95
$0.012768 per 1K, but SLO miss
Useful-work cost matrix
Optimized batching wins the useful-work denominator, but burst latency still fails the serving SLO.
Run: steady naive
Successful / dropped: 413 / 2227
p95 latency: 60.31s
Cost / 1K: $0.137976
SLO: miss
Run: steady optimized
Successful / dropped: 2670 / 0
p95 latency: 1.61s
Cost / 1K: $0.019752
SLO: pass
Run: burst naive
Successful / dropped: 227 / 2882
p95 latency: 120.00s
Cost / 1K: $0.164137
SLO: miss
Run: burst optimized
Successful / dropped: 2570 / 677
p95 latency: 10.91s
Cost / 1K: $0.012768
SLO: miss
Evidence boundary
Costs include the serving GPU model only. They do not include control plane, networking, storage, observability, idle platform cost, or operator time.
- The steady optimized profile completed 2670 successful requests with no dropped work and p95 1.61s.
- The burst optimized profile was cheapest per useful request but still missed latency with p95 10.91s.
- Burst SLO compliance still needs admission, autoscaling, or a different capacity shape.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Useful-work cost
Use batching for small-request economics, but gate burst SLO claims.
$0.019752/1K steady optimized; burst optimized p95 still 10.91s.
Decision recordRun shape
Run shape
2 cases across 512 prompt tokens and 128 output tokens, paired with 2 run profiles.
Cases
2 cases
steady-cost-efficiency
steady workload for comparing useful work per serving dollar
burst-cost-efficiency
burst workload for comparing cost efficiency under tail-latency pressure
Run profiles
2 profiles
naive-single
one active sequence reference profile for low useful work per GPU
optimized-batched
higher sequence and batched-token limits for better useful work per GPU
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Useful work
Cost
SLO
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show costExample live command
./scripts/experiment run --experiment cost --case steady-cost-efficiency --profile optimized-batched