SwiGLU Elementwise Fusion

Does fusing SwiGLU elementwise activation remove enough intermediate traffic to beat PyTorch?

Experiment catalog Source

Project

CUDA Kernel Lab

Focus

Kernel fusion

Run shape

1 case / 2 profiles

Evidence

Supported: SwiGLU fusion

Why it matters

Compares PyTorch and fused Triton SwiGLU at 4096x4096 across float16 and float32.

SwiGLU is a clean elementwise fusion track: no reduction complexity, but enough intermediate activation traffic for fusion to matter.

Supported

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Supported: SwiGLU fusionLatest reports: 2026-05-21

Fused activation removes expensive intermediate work

Fused Triton SwiGLU beat PyTorch in both float16 and float32 on the A10G strategy run.

SwiGLU fp16

2.937x

0.2437 ms Triton vs 0.7158 ms PyTorch

SwiGLU fp32

3.119x

0.4547 ms Triton vs 1.418 ms PyTorch

Correctness

pass

both recorded Triton rows passed

Evidence boundary

The result is a strong fusion proof for this shape; keep it behind the same re-profile gate before expanding to adjacent activation shapes.

Results summary Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedFusion wins

Fused SwiGLU

Treat elementwise fusion as a strong path before deeper matmul work.

fp32 0.4547 ms Triton vs 1.418 ms PyTorch; 3.119x speedup.

Decision record

Run shape

1 fused activation case at 4096x4096, compared across PyTorch and Triton profiles for float16 and float32.

Cases

1 case

swiglu-4096x4096

SwiGLU activation over LLM-shaped hidden states

4096x4096float16 and float32

Run profiles

2 profiles

torch-baseline

PyTorch SwiGLU baseline

triton-fused-swiglu

single fused Triton activation kernel

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencyspeedup vs PyTorch

Traffic

effective GB/sintermediate tensor removalnoise ratio

Validation

correctnessdtypeshape

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-swiglu --backend all --device cuda --rows 4096 --cols 4096 --dtype float32

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling

Source Results summary