Row Softmax Fusion
Does the current Triton fused row-softmax kernel beat PyTorch for 4096x1024 rows?
Project
CUDA Kernel Lab
Focus
Kernel fusion
Run shape
1 case / 2 profiles
Evidence
Rejected: Softmax win claim
Why it matters
Compares PyTorch and Triton row-softmax at 4096x1024 for float16 and float32.
Negative evidence is useful: it prevents the site from claiming every custom kernel is a win and points the next optimization question at row-shape and launch behavior.
Rejected
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Current softmax kernel should not be presented as a win
PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32 on the A10G strategy run.
PyTorch fp16
0.05018 ms
334.4 GB/s
Triton fp16
0.06554 ms
256.0 GB/s
Triton fp32
0.8247x
speedup vs PyTorch, below parity
Evidence boundary
This rejects the current implementation as a portfolio win; it does not reject future softmax optimization after row-shape and profiler work.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Row softmax
Do not present the current row-softmax kernel as a portfolio win.
PyTorch beat the current Triton fused row-softmax kernel in both float16 and float32.
Decision recordRun shape
Run shape
1 row-softmax case at 4096x1024, compared across PyTorch and Triton profiles for float16 and float32.
Cases
1 case
softmax-4096x1024
row-wise softmax over a modest attention-like row shape
Run profiles
2 profiles
torch-baseline
PyTorch softmax baseline
triton-fused-row-softmax
current fused Triton row-softmax kernel
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Throughput
Next proof
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-softmax --backend all --device cuda --rows 4096 --cols 1024 --dtype float16Example live command
./scripts/benchmark --run-id <run-id> --with-profiling