TL

GPU Inference Lab Experiments

Project-linked experiments that turn GPU serving and kernel questions into evidence-backed decisions.

16 experiments2 projectsRun-ready5 supported9 selected1 rejected0 pending1 blocked

Rows show the current proof, focus area, and decisions that still need stronger evidence.

Experiment

Memory pressure

KV Cache vs Concurrency

Purpose

Full delivery can hide queueing.

Focus
ConcurrencyKV memoryTail latency
Status
Run readySupported · Long-context kneeRejected · FP8 KV on g4dn
Experiment

Streaming latency

Prefill vs Decode Timing

Purpose

Streaming timing by request shape.

Focus
TTFTInter-token latencyThroughput
Status
Run readySelected report · Streaming split
Experiment

Scheduler behavior

Batching Scheduler Tradeoffs

Purpose

Scheduler limits versus tail latency.

Focus
Batchingp99 latencyTokens/sec
Status
Run readySelected report · Scheduler matrix
Experiment

Traffic shape

Request Pattern Utilization

Purpose

Same profile, different traffic outcome.

Focus
DeliveryTail latencyActive concurrency
Status
Run readySelected report · Pattern matrix
Experiment

Capacity response

Autoscaling and Queueing Behavior

Purpose

Scale-from-zero timing and queue policy.

Focus
Scale-from-zeroQueue policyDropped work
Status
Run readySupported · Admission behavior
Experiment

Cost efficiency

Cost per Useful Work

Purpose

Cheap only counts when useful work passes.

Focus
Cost/requestCost/tokenSLO pass
Status
Run readySupported · Useful-work cost
Experiment

Quantization

FP4 Quantization Optimization

Purpose

BF16 vs NVFP4 vs SmoothQuant.

Focus
Accuracy recoveryMemoryBuild cost
Status
Run readyBlocked · Blackwell capacity

GPU inference evidence

Decisions live in the project decision record

Admission, cold start, active-pressure HPA, FP8 KV cache, and Blackwell FP4 readiness remain in the GPU Inference Lab decision record.

View project decisions →