Blog

Engineering writing on inference serving, scheduler behavior, and GPU kernel paths.

Inference internals June 21, 2026 · 2 min read

The KV Cache Is the Real Batch-Size Ceiling

It is tempting to size a serving deployment by how many requests the GPU can compute in parallel. In practice the limit shows up earlier and somewhere else: the KV cache. Every concurrent sequence reserves memory that grows with its length, and once that pool is exhausted the scheduler stops admitting work no matter how much compute is idle.

KV cacheBatchingGPU memoryScheduler

Inference internals June 20, 2026 · 1 min read

Continuous Batching Changes What Throughput Means

Static batching reports a number that rarely survives contact with real traffic. Continuous batching reshapes the GPU's work queue request by request, so the throughput you measure depends entirely on how arrivals overlap. This post explains why the headline tokens/sec is the wrong thing to optimize in isolation.

Continuous batchingSchedulerThroughputvLLM

Inference internals June 18, 2026 · 3 min read

Why Prefix Cache Hit Rate Is the First Number to Check

Before tuning batch size or concurrency, the KV prefix cache hit rate tells you whether you are even in the right problem space. This post walks through what the metric captures, why it compounds across requests, and how to read it before touching any other serving parameter.

Prefix cacheKV cacheSGLangServing