TL

CUDA Kernel Lab

A CUDA/Triton optimization lab organized around a measured loop: profile first, identify the bottleneck, optimize that bottleneck, then re-profile before claiming a win.

Kernel optimizationA10G/H200 benchmark evidence9 experiments

RMSNorm fusion remains the strongest supported win, while H200 matmul autotune now bounds the Tensor Core gap against PyTorch/cuBLAS.

At a glance

The lab moves from memory traffic and reductions into fusion, tiling, Tensor Cores, H200 matmul autotune, and decode-step graph replay with profiler-backed strategy comparisons.

Device

A10G + H200

A10G AWS operator evidence now sits beside RunPod H200 matmul/Tensor Core tuning.

Benchmark matrix

A10G + H200

115 operator rows plus 27 decode replay rows on A10G, with H200 autotune and persistent-wave matmul reports added.

Correctness

Selected rows pass

The selected A10G operator/decode rows and latest H200 matmul rows passed correctness checks.

Largest win

5.901x

Triton fused RMSNorm fp16 on the 4096x8192 shape versus the PyTorch baseline.

H200 matmul

88-90%

Best standard Triton tiled-dot rows trail PyTorch/cuBLAS but close most of the gap on LLM-shaped GEMMs.

Profiler proof

90.91% DRAM

Nsight Compute captured the RMSNorm win with 93.12% occupancy.

Decode replay

~0.156 ms

Dynamic same-stream piecewise CUDA Graph replay p50, with about 0.230 ms p95 across dense-bucket tail seeds.

Workflow

Optimization loop

Every kernel claim moves through the same loop: profile first, isolate the bottleneck, change only that path, then re-profile before calling it a win.

01

Profile first

Start from matched PyTorch/Triton rows, correctness checks, and Nsight counters when the benchmark is interesting enough to explain.

115-row strategy run, 27-row decode replay report, and compact profiler summaries

02

Identify bottleneck

Classify the limit as memory bandwidth, launch overhead, occupancy, register pressure, Tensor Core utilization, or traffic from unfused intermediates.

Memory primitives lead on DRAM, matmul is Tensor Core limited, decode shifts to launch and scheduling overhead

03

Optimize that bottleneck

Make one bounded change: fuse rows, change block size, compare reduction strategy, sweep matmul tiles, or reduce decode replay overhead.

RMSNorm and SwiGLU fusion win; resident decode replay trims hot-loop cost

04

Re-profile

Compare the next run against the same control and promote only supported, caveated, rejected, or pending claims.

RMSNorm supported; softmax rejected; H200 matmul remains below PyTorch/cuBLAS; decode replay remains a synthetic upper bound

A10G/H200 evidence

Benchmark readout

The selected runs show where custom kernels are useful, where profiler counters explain the result, and where newer H200 matmul tuning remains measured but caveated.

Fused RMSNorm

Fusion removes enough framework and intermediate-tensor overhead to beat the PyTorch baseline across the measured shape sweep.

fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.

Supported

Fused SwiGLU

Elementwise fusion is a strong path before deeper matmul work.

fp32 0.4547 ms Triton vs 1.418 ms PyTorch; 3.119x speedup.

Supported

Matmul tile sweep

Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.

Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.

Caveated

H200 matmul autotune

Standard tiled-dot is the current H200 winner; persistent-wave scheduling is measured but not yet useful.

Focused H200 rows put best standard Triton at 470.7-471.4 TFLOP/s bf16, about 89% of PyTorch/cuBLAS; persistent waves stayed far behind standard.

Measured, caveated

Decode graph replay

Resident-KV same-stream piecewise CUDA Graph replay now has fixed-shape and dynamic-bucket evidence.

Round 12 reached 0.1375 ms fixed-shape p50 and roughly 0.156 ms dynamic p50 / 0.230 ms p95 with dense buckets and zero padding.

Measured, caveated

Profiler-backed proof

Use counters to explain the benchmark instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Measured