CUDA Kernel Lab Decisions
Kernel optimization calls derived from CUDA/Triton benchmark and profiler evidence across A10G and H200, with caveats kept beside the experiment that produced them.
Supported
2
Measured
1
Caveated
5
Rejected
1
Decision matrix
Kernel Optimization Decisions
Each decision is grouped by domain and links back to the experiments that produced the evidence.
Domain
Fusion wins
Fusion wins
Fused RMSNorm
Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.
fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.
Fusion wins
Fused SwiGLU
Treat elementwise fusion as a strong path before deeper matmul work.
fp32 0.4547 ms Triton vs 1.418 ms PyTorch; 3.119x speedup.
Domain
Memory/reduction boundaries
Memory/reduction boundaries
Memory primitives
Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.
Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.
Next evidence
Prioritize fusion or reuse changes before wider simple-memory launch sweeps.
Memory/reduction boundaries
Reduction strategy
Keep reduction variants as an active track, but do not claim a PyTorch win yet.
Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.
Next evidence
Target launch/finalization overhead or a different reduction structure.
Memory/reduction boundaries
Row softmax
Do not present the current row-softmax kernel as a portfolio win.
PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32.
Domain
Matmul/Tensor Core gaps
Matmul/Tensor Core gaps
Matmul tile sweep
Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.
Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.
Next evidence
Profile the current best tile before the next tuning pass.
Matmul/Tensor Core gaps
H200 matmul autotune
Keep H200 matmul as an active gap-closing track; standard tiled-dot is currently best, while persistent-wave scheduling is measured but not yet useful.
Clean focused H200 timing reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16, about 88-90% of PyTorch/cuBLAS; the latest persistent-wave sweep kept standard Triton ahead at 471.4 TFLOP/s bfloat16.
Next evidence
Run Tensor Core counter profiles on an H200 host with NVIDIA performance-counter access; current RunPod profile reports hit permission failures.
Matmul/Tensor Core gaps
Profiler-backed proof
Use counters to explain benchmark results instead of treating latency alone as the answer.
RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.
Domain
Decode replay caveats
Decode replay caveats
Decode graph replay
Use resident-KV same-stream piecewise CUDA Graph replay as a measured synthetic upper bound, not an end-to-end serving claim.
Round 12 reached 0.1375 ms fixed-shape p50 and about 0.156 ms dynamic p50 / 0.230 ms p95.
Next evidence
Turn timing probes back on to explain the remaining p95 tail.