ML Inference Performance Engineering

Tony Lee

I optimize GPU serving paths from Kubernetes scheduling to CUDA kernels with reproducible latency, throughput, and cost measurements.

View projects View experiments View resume

Latest writing

01
Inference internals
The KV Cache Is the Real Batch-Size Ceiling
2026-06-21
02
Inference internals
Continuous Batching Changes What Throughput Means
2026-06-20
03
Inference internals
Why Prefix Cache Hit Rate Is the First Number to Check
2026-06-18