SwiGLU Elementwise Fusion
Does fusing SwiGLU elementwise activation remove enough intermediate traffic to beat PyTorch?
Project
CUDA Kernel Lab
Focus
Kernel fusion
Run shape
1 case / 2 profiles
Evidence
Supported: SwiGLU fusion
Why it matters
Compares PyTorch and fused Triton SwiGLU at 4096x4096 across float16 and float32.
SwiGLU is a clean elementwise fusion track: no reduction complexity, but enough intermediate activation traffic for fusion to matter.
Supported
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Fused activation removes expensive intermediate work
Fused Triton SwiGLU beat PyTorch in both float16 and float32 on the A10G strategy run.
SwiGLU fp16
2.937x
0.2437 ms Triton vs 0.7158 ms PyTorch
SwiGLU fp32
3.119x
0.4547 ms Triton vs 1.418 ms PyTorch
Correctness
pass
both recorded Triton rows passed
Evidence boundary
The result is a strong fusion proof for this shape; keep it behind the same re-profile gate before expanding to adjacent activation shapes.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Fused SwiGLU
Treat elementwise fusion as a strong path before deeper matmul work.
fp32 0.4547 ms Triton vs 1.418 ms PyTorch; 3.119x speedup.
Decision recordRun shape
Run shape
1 fused activation case at 4096x4096, compared across PyTorch and Triton profiles for float16 and float32.
Cases
1 case
swiglu-4096x4096
SwiGLU activation over LLM-shaped hidden states
Run profiles
2 profiles
torch-baseline
PyTorch SwiGLU baseline
triton-fused-swiglu
single fused Triton activation kernel
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Traffic
Validation
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-swiglu --backend all --device cuda --rows 4096 --cols 4096 --dtype float32Example live command
./scripts/benchmark --run-id <run-id> --with-profiling