Projects
Evidence-backed GPU systems work across serving infrastructure and kernel optimization.
Serving infrastructure decisions and CUDA kernel optimization are separated by project, with experiments attached to the evidence they support.
GPU Inference Decision Lab
An EKS/vLLM lab that turns serving measurements into architecture decisions for admission, autoscaling, context limits, scheduling, and quantization.
EKS/vLLM measurements support admission, long-context boundaries, scheduler defaults, useful-work cost, and FP8 KV rejection.
- 100% queued delivery across burst and spike-to-zero admission runs
- 1.20 req/s long-context knee repeats with 36.8s p95 queue delay
- FP8 KV rejected on the current g4dn/vLLM path
CUDA Kernel Lab
A CUDA/Triton optimization lab organized around profile-driven kernel work for LLM-shaped primitives across A10G and H200.
RMSNorm fusion remains the strongest supported win, while H200 matmul autotune now bounds the Tensor Core gap against PyTorch/cuBLAS.
- 115 operator rows plus 27 decode replay rows on A10G
- H200 matmul rows keep best standard Triton around 88-90% of PyTorch/cuBLAS
- RMSNorm fp16 reached 5.901x over the PyTorch baseline