Tony Lee

ML Inference Performance Engineering

I optimize GPU serving paths from Kubernetes scheduling to CUDA kernels with reproducible latency, throughput, and cost measurements.

View projects View experiments View resume

Latest writing

Read the blog →

Inference internals

The KV Cache Is the Real Batch-Size Ceiling

June 21, 2026

Inference internals

Continuous Batching Changes What Throughput Means

June 20, 2026

Inference internals

Why Prefix Cache Hit Rate Is the First Number to Check

June 18, 2026