CUDA Kernel Lab Experiments
Project-linked experiments that turn GPU serving and kernel questions into evidence-backed decisions.
Rows show the current proof, focus area, and decisions that still need stronger evidence.
Kernel memory traffic
Memory Primitive Bandwidth
PyTorch still wins simple memory traffic.
Kernel reductions
Reduction Strategy Comparison
Two Triton reduction strategies trail PyTorch.
Kernel fusion
Normalization Fusion
RMSNorm fusion stays strong across the shape sweep.
Kernel fusion
SwiGLU Elementwise Fusion
Fused SwiGLU is a clean 3x-class win.
Kernel fusion
Row Softmax Fusion
Current Triton softmax trails PyTorch.
Kernel tiling
Matmul Tile Sweep
Best Triton tile is measured but still below cuBLAS.
Tensor Core matmul
H200 Matmul Autotune
Standard tiled-dot closes most of the H200 gap, but persistent waves are not a win yet.
Decode scheduling
Decode Step Graph Replay
Resident-KV graph replay is now measured and caveated.
Profiler evidence
Profiler Validation
Profiler counters now explain the strongest win and active gaps.