Tony Lee

H200 Matmul Autotune

Which Triton tiled-dot and persistent-wave schedules get closest to the PyTorch/cuBLAS H200 matmul baseline for LLM-shaped GEMMs?

Project

CUDA Kernel Lab

Focus

Tensor Core matmul

Run shape

3 cases / 3 profiles

Evidence

Selected report: H200 matmul gap

Why it matters

Compares H200 PyTorch/cuBLAS against repeated Triton tiled-dot candidates, then tests whether persistent resident-program waves improve the focused 512x11008x4096 shape.

H200 matmul is the Tensor Core track that determines whether custom Triton GEMMs can become more than a tuning exercise. The current evidence is valuable because it shows a bounded gap and prevents overclaiming the new persistent scheduler.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: H200 matmul gapLatest reports: 2026-05-26

H200 matmul is bounded but not a replacement win

Clean H200 timing runs show best standard Triton tiled-dot rows around 88-90% of PyTorch/cuBLAS, while the latest persistent-wave sweep improves over one-wave variants but stays far behind the standard schedule.

Clean focused bf16

89.31%

470.7 Triton TFLOP/s vs 527.1 torch on 512x11008x4096

Latest focused bf16

89.41%

471.4 Triton TFLOP/s vs 527.3 torch in the persistent-wave sweep

Persistent wave 2

179.1 TFLOP/s

best bf16 persistent row, still roughly one third of torch

H200 focused matmul gap

Dtype: bfloat16

Current call: standard tiled-dot leads Triton

Triton: 470.7 TFLOP/s

Torch: 527.1 TFLOP/s

Triton/Torch: 89.31%

Dtype: float16

Current call: standard tiled-dot leads Triton

Triton: 462.0 TFLOP/s

Torch: 520.4 TFLOP/s

Triton/Torch: 88.78%

Persistent-wave check

Schedule: standard bf16

Call: current best

Best Triton: 471.4 TFLOP/s

Torch: 527.3 TFLOP/s

Boundary: 89.41%

Schedule: persistent wave 2 bf16

Call: measured non-win

Best Triton: 179.1 TFLOP/s

Torch: 527.3 TFLOP/s

Boundary: below standard

Schedule: persistent waves 3-4 bf16

Call: measured non-win

Best Triton: 171.7-173.3 TFLOP/s

Torch: 527.3 TFLOP/s

Boundary: below wave 2

Evidence boundary

Use this as active gap-closing evidence, not a custom matmul win. The latest persistent-wave report is exploratory, and the canonical next step is a counter-enabled H200 profile for the selected winners.

  • The clean 2026-05-25 timing-only run reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16 for standard Triton on 512x11008x4096.
  • The latest persistent-wave sweep reached 471.4 TFLOP/s bfloat16 for standard Triton on the same focused shape, with all 60 correctness checks passing.
  • The best persistent wave-2 rows reached about 179 TFLOP/s, improving over one-wave persistent rows but remaining far below the standard tiled-dot schedule.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

CaveatedMatmul/Tensor Core gaps

H200 matmul autotune

Keep H200 matmul as an active gap-closing track; standard tiled-dot is currently best, while persistent-wave scheduling is measured but not yet useful.

Clean focused H200 timing reached 470.7 TFLOP/s bfloat16 and 462.0 TFLOP/s float16, about 88-90% of PyTorch/cuBLAS; the latest persistent-wave sweep kept standard Triton ahead at 471.4 TFLOP/s bfloat16.

Decision record

Run shape

Run shape

H200 512x11008x4096 focused sweep plus broader 512x4096x11008, 512x11008x4096, and 4096x4096x4096 winner profiling shapes across float16 and bfloat16.

Cases

3 cases

h200-llm-down-512x11008x4096

focused LLM down-projection GEMM used for standard and persistent-wave comparison

512x11008x4096float16 and bfloat16

h200-llm-up-512x4096x11008

asymmetric LLM up-projection GEMM from the clean winner profile sweep

512x4096x11008float16 and bfloat16

h200-square-4096x4096x4096

square Tensor Core GEMM from the clean winner profile sweep

4096x4096x4096float16 and bfloat16

Run profiles

3 profiles

torch-baseline

PyTorch/cuBLAS reference baseline on H200

triton-standard-tiled-dot

best measured standard Triton tiled-dot schedule

triton-persistent-waves-1-4

persistent tiled-dot schedule with one through four resident-program waves per SM

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Compute

TFLOP/sp50 latencyTriton/Torch %

Schedule

standard tiled-dotpersistent wavesgroup_m

Next proof

Tensor Core countersprofile-counter accesswinner stability

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-autotune --input-dir experiments/results/runpod/20260526-h200-persistent-waves-073234 --dry-run

Example live command

./scripts/benchmark --run-id <run-id> --suite h200-matmul-autotune --matmul-autotune-schedules standard,persistent --matmul-autotune-persistent-waves 1,2,3,4 --with-profiling