TL

CUDA Kernel Lab Decisions

Kernel optimization calls derived from CUDA/Triton benchmark and profiler evidence across A10G and H200, with caveats kept beside the experiment that produced them.

Supported

2

Measured

1

Caveated

5

Rejected

1

Decision matrix

Kernel Optimization Decisions

Each decision is grouped by domain and links back to the experiments that produced the evidence.

Domain

Fusion wins

Fusion wins

Fused RMSNorm

Supported

Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.

fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.

Fusion wins

Fused SwiGLU

Supported

Treat elementwise fusion as a strong path before deeper matmul work.

fp32 0.4547 ms Triton vs 1.418 ms PyTorch; 3.119x speedup.

Domain

Memory/reduction boundaries

Memory/reduction boundaries

Memory primitives

Caveated

Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.

Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.

Next evidence

Prioritize fusion or reuse changes before wider simple-memory launch sweeps.

Memory/reduction boundaries

Reduction strategy

Caveated

Keep reduction variants as an active track, but do not claim a PyTorch win yet.

Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.

Next evidence

Target launch/finalization overhead or a different reduction structure.

Memory/reduction boundaries

Row softmax

Rejected

Do not present the current row-softmax kernel as a portfolio win.

PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32.

Domain

Matmul/Tensor Core gaps

Matmul/Tensor Core gaps

Matmul tile sweep

Caveated

Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.

Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.

Next evidence

Profile the current best tile before the next tuning pass.

Matmul/Tensor Core gaps

H200 matmul autotune

Caveated

Keep H200 matmul as an active gap-closing track; standard tiled-dot is currently best, while persistent-wave scheduling is measured but not yet useful.

Clean focused H200 timing reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16, about 88-90% of PyTorch/cuBLAS; the latest persistent-wave sweep kept standard Triton ahead at 471.4 TFLOP/s bfloat16.

Next evidence

Run Tensor Core counter profiles on an H200 host with NVIDIA performance-counter access; current RunPod profile reports hit permission failures.

Matmul/Tensor Core gaps

Profiler-backed proof

Measured

Use counters to explain benchmark results instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Domain

Decode replay caveats

Decode replay caveats

Decode graph replay

Caveated

Use resident-KV same-stream piecewise CUDA Graph replay as a measured synthetic upper bound, not an end-to-end serving claim.

Round 12 reached 0.1375 ms fixed-shape p50 and about 0.156 ms dynamic p50 / 0.230 ms p95.

Next evidence

Turn timing probes back on to explain the remaining p95 tail.