It is tempting to size a serving deployment by how many requests the GPU can compute in parallel. In practice the limit shows up earlier and somewhere else: the KV cache. Every concurrent sequence reserves memory that grows with its length, and once that pool is exhausted the scheduler stops admitting work no matter how much compute is idle.
Static batching reports a number that rarely survives contact with real traffic. Continuous batching reshapes the GPU's work queue request by request, so the throughput you measure depends entirely on how arrivals overlap. This post explains why the headline tokens/sec is the wrong thing to optimize in isolation.
Before tuning batch size or concurrency, the KV prefix cache hit rate tells you whether you are even in the right problem space. This post walks through what the metric captures, why it compounds across requests, and how to read it before touching any other serving parameter.