Tony Lee

Matmul Tile Sweep

Which Triton tile and launch configuration gets closest to the PyTorch/cuBLAS matmul baseline?

Project

CUDA Kernel Lab

Focus

Kernel tiling

Run shape

1 case / 3 profiles

Evidence

Selected report: Tile sweep

Why it matters

Compares focused float16 tile shapes, warp counts, pipeline stages, and Tensor Core profiler counters for 1024x1024x1024 matmul.

Matmul is the bridge from memory traffic and fusion into the compute path that dominates transformer inference; the current evidence is useful because it shows where Triton is not yet enough.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Tile sweepLatest reports: 2026-05-21

Triton tiling is active but not yet a cuBLAS win

The strategy run found a best Triton tile at 25.74 TFLOP/s, while PyTorch/cuBLAS stayed around 30-31 TFLOP/s on the same A10G shape.

Best Triton tile

25.74 TFLOP/s

128x64x32, 8 warps, 3 stages, tf32 input precision

PyTorch/cuBLAS

30-31 TFLOP/s

same 1024x1024x1024 float16 shape

Profiled tile

45.19% Tensor Core

22.49% occupancy, 80 registers/thread, 16 KiB dynamic shared memory

Evidence boundary

This is progress evidence for the tiling track, not a claim that the custom matmul should replace the library baseline.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

CaveatedMatmul/Tensor Core gaps

Matmul tile sweep

Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.

Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.

Decision record
MeasuredMatmul/Tensor Core gaps

Profiler-backed proof

Use counters to explain benchmark results instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Decision record

Run shape

Run shape

1024x1024x1024 float16 matmul tile sweep with block M/N/K, num_warps, num_stages, and input_precision variants.

Cases

1 case

matmul-1024x1024x1024

baseline square matmul shape for tile strategy work

M=N=K=1024float16

Run profiles

3 profiles

torch-baseline

PyTorch matmul baseline

triton-best-tile

best measured Triton tile: 128x64x32, 8 warps, 3 stages, tf32 input precision

triton-profiled-tile

profiled Triton tile: 64x64x32, 4 warps, 3 stages

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Compute

TFLOP/sp50 latencyspeedup vs PyTorch

Tile strategy

block Mblock Nblock Kshared memory

Profiler

Tensor Core utilizationoccupancyregister pressure

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-matmul --backend all --device cuda --m 1024 --n 1024 --k 1024 --dtype float16

Example live command

./scripts/benchmark --run-id <run-id> --include-matmul-sweep --with-profiling