Matmul Tile Sweep
Which Triton tile and launch configuration gets closest to the PyTorch/cuBLAS matmul baseline?
Project
CUDA Kernel Lab
Focus
Kernel tiling
Run shape
1 case / 3 profiles
Evidence
Selected report: Tile sweep
Why it matters
Compares focused float16 tile shapes, warp counts, pipeline stages, and Tensor Core profiler counters for 1024x1024x1024 matmul.
Matmul is the bridge from memory traffic and fusion into the compute path that dominates transformer inference; the current evidence is useful because it shows where Triton is not yet enough.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Triton tiling is active but not yet a cuBLAS win
The strategy run found a best Triton tile at 25.74 TFLOP/s, while PyTorch/cuBLAS stayed around 30-31 TFLOP/s on the same A10G shape.
Best Triton tile
25.74 TFLOP/s
128x64x32, 8 warps, 3 stages, tf32 input precision
PyTorch/cuBLAS
30-31 TFLOP/s
same 1024x1024x1024 float16 shape
Profiled tile
45.19% Tensor Core
22.49% occupancy, 80 registers/thread, 16 KiB dynamic shared memory
Evidence boundary
This is progress evidence for the tiling track, not a claim that the custom matmul should replace the library baseline.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Matmul tile sweep
Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.
Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.
Decision recordProfiler-backed proof
Use counters to explain benchmark results instead of treating latency alone as the answer.
RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.
Decision recordRun shape
Run shape
1024x1024x1024 float16 matmul tile sweep with block M/N/K, num_warps, num_stages, and input_precision variants.
Cases
1 case
matmul-1024x1024x1024
baseline square matmul shape for tile strategy work
Run profiles
3 profiles
torch-baseline
PyTorch matmul baseline
triton-best-tile
best measured Triton tile: 128x64x32, 8 warps, 3 stages, tf32 input precision
triton-profiled-tile
profiled Triton tile: 64x64x32, 4 warps, 3 stages
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Compute
Tile strategy
Profiler
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-matmul --backend all --device cuda --m 1024 --n 1024 --k 1024 --dtype float16Example live command
./scripts/benchmark --run-id <run-id> --include-matmul-sweep --with-profiling