Experiments
Project-linked catalogs that turn GPU serving and kernel questions into evidence-backed decisions.
Choose a project to browse its experiment table. Detail routes stay available for individual run shape, evidence, and commands.
GPU Inference Decision Lab
An EKS/vLLM lab that turns serving measurements into architecture decisions for admission, autoscaling, context limits, scheduling, and quantization.
EKS/vLLM measurements support admission, long-context boundaries, scheduler defaults, useful-work cost, and FP8 KV rejection.
- 100% queued delivery across burst and spike-to-zero admission runs
- 1.20 req/s long-context knee repeats with 36.8s p95 queue delay
- FP8 KV rejected on the current g4dn/vLLM path
CUDA Kernel Lab
A CUDA/Triton optimization lab organized around profile-driven kernel work for LLM-shaped primitives across A10G and H200.
RMSNorm fusion remains the strongest supported win, while H200 matmul autotune now bounds the Tensor Core gap against PyTorch/cuBLAS.
- 115 operator rows plus 27 decode replay rows on A10G
- H200 matmul rows keep best standard Triton around 88-90% of PyTorch/cuBLAS
- RMSNorm fp16 reached 5.901x over the PyTorch baseline