Normalization Fusion

How much does a fused Triton RMSNorm or LayerNorm kernel move latency versus the PyTorch baseline?

Experiment catalog Source

Project

CUDA Kernel Lab

Focus

Kernel fusion

Run shape

4 cases / 3 profiles

Evidence

Supported: RMSNorm fusion

Why it matters

Compares PyTorch and Triton fused normalization kernels, then checks whether the RMSNorm fp16 win holds across hidden-size shapes.

Normalization is a realistic LLM primitive where fusion can remove expensive intermediate work; the shape sweep makes the win harder to dismiss as one lucky point.

Supported

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Supported: RMSNorm fusionLatest reports: 2026-05-21

RMSNorm shape sweep keeps fusion as the largest win

Triton fused RMSNorm produced the largest speedups in the A10G strategy run and stayed ahead across the measured fp16 shape sweep.

RMSNorm fp16 max

5.901x

4096x8192, 0.3103 ms Triton vs 1.831 ms PyTorch

RMSNorm profile

90.91% DRAM

4096x4096 fp16 with 93.12% occupancy

LayerNorm fp32

1.379x

0.3133 ms Triton vs 0.4321 ms PyTorch

RMSNorm fp16 shape sweep

ShapeWinnerTriton p50SpeedupCall

Shape: 512x1024

Winner: Triton

Triton p50: 0.05018 ms

Speedup: 2.367x

Call: rerun noise

Shape: 2048x4096

Winner: Triton

Triton p50: 0.1024 ms

Speedup: 4.76x

Call: supported

Shape: 4096x8192

Winner: Triton

Triton p50: 0.3103 ms

Speedup: 5.901x

Call: supported

Evidence boundary

The two smallest shape-sweep rows were noisy enough to rerun before precise claims, but the larger rows and profiler counters support the fusion direction.

Results summary Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedFusion wins

Fused RMSNorm

Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.

fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.

Decision record

MeasuredMatmul/Tensor Core gaps

Profiler-backed proof

Use counters to explain benchmark results instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Decision record

Run shape

RMSNorm fp16 shape sweep from 512x1024 through 4096x8192, plus 4096x4096 LayerNorm and RMSNorm comparisons.

Cases

4 cases

rmsnorm-512x1024-float16

small RMSNorm shape sweep point

512x1024float16

rmsnorm-2048x4096-float16

mid-size RMSNorm shape sweep point

2048x4096float16

rmsnorm-4096x4096-float16

RMSNorm forward pass over LLM-shaped rows

4096x4096float16

layernorm-4096x4096-float32

LayerNorm forward pass over LLM-shaped rows

4096x4096float32

Run profiles

3 profiles

torch-baseline

PyTorch normalization baseline

triton-fused-rmsnorm

single fused RMSNorm Triton kernel

triton-fused-layernorm

single fused LayerNorm Triton kernel

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencyspeedup vs PyTorch

Throughput

effective GB/seffective TFLOP/s

Validation

correctnessdtype tolerancenoise ratio

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-norms --backend all --device cuda --rows 4096 --cols 4096 --dtype float16

Example live command

./scripts/benchmark --run-id <run-id> --include-rmsnorm-shape-sweep --with-profiling

Source Results summary