Row Softmax Fusion

Does the current Triton fused row-softmax kernel beat PyTorch for 4096x1024 rows?

Experiment catalog Source

Project

CUDA Kernel Lab

Focus

Kernel fusion

Run shape

1 case / 2 profiles

Evidence

Rejected: Softmax win claim

Why it matters

Compares PyTorch and Triton row-softmax at 4096x1024 for float16 and float32.

Negative evidence is useful: it prevents the site from claiming every custom kernel is a win and points the next optimization question at row-shape and launch behavior.

Rejected

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Rejected: Softmax win claimLatest reports: 2026-05-21

Current softmax kernel should not be presented as a win

PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32 on the A10G strategy run.

PyTorch fp16

0.05018 ms

334.4 GB/s

Triton fp16

0.06554 ms

256.0 GB/s

Triton fp32

0.8247x

speedup vs PyTorch, below parity

Evidence boundary

This rejects the current implementation as a portfolio win; it does not reject future softmax optimization after row-shape and profiler work.

Results summary Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

RejectedMemory/reduction boundaries

Row softmax

Do not present the current row-softmax kernel as a portfolio win.

PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32.

Decision record

Run shape

1 row-softmax case at 4096x1024, compared across PyTorch and Triton profiles for float16 and float32.

Cases

1 case

softmax-4096x1024

row-wise softmax over a modest attention-like row shape

4096x1024float16 and float32

Run profiles

2 profiles

torch-baseline

PyTorch softmax baseline

triton-fused-row-softmax

current fused Triton row-softmax kernel

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencynoise ratio

Throughput

effective GB/seffective TFLOP/s

Next proof

row-size sweepoccupancymemory throughput

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-softmax --backend all --device cuda --rows 4096 --cols 1024 --dtype float16

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling

Source Results summary