H200 Matmul Autotune
Which Triton tiled-dot and persistent-wave schedules get closest to the PyTorch/cuBLAS H200 matmul baseline for LLM-shaped GEMMs?
Project
CUDA Kernel Lab
Focus
Tensor Core matmul
Run shape
3 cases / 3 profiles
Evidence
Selected report: H200 matmul gap
Why it matters
Compares H200 PyTorch/cuBLAS against repeated Triton tiled-dot candidates, then tests whether persistent resident-program waves improve the focused 512x11008x4096 shape.
H200 matmul is the Tensor Core track that determines whether custom Triton GEMMs can become more than a tuning exercise. The current evidence is valuable because it shows a bounded gap and prevents overclaiming the new persistent scheduler.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
H200 matmul is bounded but not a replacement win
Clean H200 timing runs show best standard Triton tiled-dot rows around 88-90% of PyTorch/cuBLAS, while the latest persistent-wave sweep improves over one-wave variants but stays far behind the standard schedule.
Clean focused bf16
89.31%
470.7 Triton TFLOP/s vs 527.1 torch on 512x11008x4096
Latest focused bf16
89.41%
471.4 Triton TFLOP/s vs 527.3 torch in the persistent-wave sweep
Persistent wave 2
179.1 TFLOP/s
best bf16 persistent row, still roughly one third of torch
H200 focused matmul gap
Dtype: bfloat16
Current call: standard tiled-dot leads Triton
Triton: 470.7 TFLOP/s
Torch: 527.1 TFLOP/s
Triton/Torch: 89.31%
Dtype: float16
Current call: standard tiled-dot leads Triton
Triton: 462.0 TFLOP/s
Torch: 520.4 TFLOP/s
Triton/Torch: 88.78%
Persistent-wave check
Schedule: standard bf16
Call: current best
Best Triton: 471.4 TFLOP/s
Torch: 527.3 TFLOP/s
Boundary: 89.41%
Schedule: persistent wave 2 bf16
Call: measured non-win
Best Triton: 179.1 TFLOP/s
Torch: 527.3 TFLOP/s
Boundary: below standard
Schedule: persistent waves 3-4 bf16
Call: measured non-win
Best Triton: 171.7-173.3 TFLOP/s
Torch: 527.3 TFLOP/s
Boundary: below wave 2
Evidence boundary
Use this as active gap-closing evidence, not a custom matmul win. The latest persistent-wave report is exploratory, and the canonical next step is a counter-enabled H200 profile for the selected winners.
- The clean 2026-05-25 timing-only run reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16 for standard Triton on 512x11008x4096.
- The latest persistent-wave sweep reached 471.4 TFLOP/s bfloat16 for standard Triton on the same focused shape, with all 60 correctness checks passing.
- The best persistent wave-2 rows reached about 179 TFLOP/s, improving over one-wave persistent rows but remaining far below the standard tiled-dot schedule.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
H200 matmul autotune
Keep H200 matmul as an active gap-closing track; standard tiled-dot is currently best, while persistent-wave scheduling is measured but not yet useful.
Clean focused H200 timing reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16, about 88-90% of PyTorch/cuBLAS; the latest persistent-wave sweep kept standard Triton ahead at 471.4 TFLOP/s bfloat16.
Decision recordRun shape
Run shape
H200 512x11008x4096 focused sweep plus broader 512x4096x11008, 512x11008x4096, and 4096x4096x4096 winner profiling shapes across float16 and bfloat16.
Cases
3 cases
h200-llm-down-512x11008x4096
focused LLM down-projection GEMM used for standard and persistent-wave comparison
h200-llm-up-512x4096x11008
asymmetric LLM up-projection GEMM from the clean winner profile sweep
h200-square-4096x4096x4096
square Tensor Core GEMM from the clean winner profile sweep
Run profiles
3 profiles
torch-baseline
PyTorch/cuBLAS reference baseline on H200
triton-standard-tiled-dot
best measured standard Triton tiled-dot schedule
triton-persistent-waves-1-4
persistent tiled-dot schedule with one through four resident-program waves per SM
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Compute
Schedule
Next proof
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-autotune --input-dir experiments/results/runpod/20260526-h200-persistent-waves-073234 --dry-runExample live command
./scripts/benchmark --run-id <run-id> --suite h200-matmul-autotune --matmul-autotune-schedules standard,persistent --matmul-autotune-persistent-waves 1,2,3,4 --with-profiling