GPU Inference Lab Experiments
Project-linked experiments that turn GPU serving and kernel questions into evidence-backed decisions.
Rows show the current proof, focus area, and decisions that still need stronger evidence.
Memory pressure
KV Cache vs Concurrency
Full delivery can hide queueing.
Streaming latency
Prefill vs Decode Timing
Streaming timing by request shape.
Scheduler behavior
Batching Scheduler Tradeoffs
Scheduler limits versus tail latency.
Traffic shape
Request Pattern Utilization
Same profile, different traffic outcome.
Capacity response
Autoscaling and Queueing Behavior
Scale-from-zero timing and queue policy.
Cost efficiency
Cost per Useful Work
Cheap only counts when useful work passes.
Quantization
FP4 Quantization Optimization
BF16 vs NVFP4 vs SmoothQuant.
GPU inference evidence
Decisions live in the project decision record
Admission, cold start, active-pressure HPA, FP8 KV cache, and Blackwell FP4 readiness remain in the GPU Inference Lab decision record.