Normalization Fusion
How much does a fused Triton RMSNorm or LayerNorm kernel move latency versus the PyTorch baseline?
Project
CUDA Kernel Lab
Focus
Kernel fusion
Run shape
4 cases / 3 profiles
Evidence
Supported: RMSNorm fusion
Why it matters
Compares PyTorch and Triton fused normalization kernels, then checks whether the RMSNorm fp16 win holds across hidden-size shapes.
Normalization is a realistic LLM primitive where fusion can remove expensive intermediate work; the shape sweep makes the win harder to dismiss as one lucky point.
Supported
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
RMSNorm shape sweep keeps fusion as the largest win
Triton fused RMSNorm produced the largest speedups in the A10G strategy run and stayed ahead across the measured fp16 shape sweep.
RMSNorm fp16 max
5.901x
4096x8192, 0.3103 ms Triton vs 1.831 ms PyTorch
RMSNorm profile
90.91% DRAM
4096x4096 fp16 with 93.12% occupancy
LayerNorm fp32
1.379x
0.3133 ms Triton vs 0.4321 ms PyTorch
RMSNorm fp16 shape sweep
Shape: 512x1024
Winner: Triton
Triton p50: 0.05018 ms
Speedup: 2.367x
Call: rerun noise
Shape: 2048x4096
Winner: Triton
Triton p50: 0.1024 ms
Speedup: 4.76x
Call: supported
Shape: 4096x8192
Winner: Triton
Triton p50: 0.3103 ms
Speedup: 5.901x
Call: supported
Evidence boundary
The two smallest shape-sweep rows were noisy enough to rerun before precise claims, but the larger rows and profiler counters support the fusion direction.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Fused RMSNorm
Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.
fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.
Decision recordProfiler-backed proof
Use counters to explain benchmark results instead of treating latency alone as the answer.
RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.
Decision recordRun shape
Run shape
RMSNorm fp16 shape sweep from 512x1024 through 4096x8192, plus 4096x4096 LayerNorm and RMSNorm comparisons.
Cases
4 cases
rmsnorm-512x1024-float16
small RMSNorm shape sweep point
rmsnorm-2048x4096-float16
mid-size RMSNorm shape sweep point
rmsnorm-4096x4096-float16
RMSNorm forward pass over LLM-shaped rows
layernorm-4096x4096-float32
LayerNorm forward pass over LLM-shaped rows
Run profiles
3 profiles
torch-baseline
PyTorch normalization baseline
triton-fused-rmsnorm
single fused RMSNorm Triton kernel
triton-fused-layernorm
single fused LayerNorm Triton kernel
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Throughput
Validation
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-norms --backend all --device cuda --rows 4096 --cols 4096 --dtype float16Example live command
./scripts/benchmark --run-id <run-id> --include-rmsnorm-shape-sweep --with-profiling