Reduction Strategy Comparison
Does an iterative Triton reduction or a two-pass Triton reduction perform better for a 16M-element float32 sum?
Project
CUDA Kernel Lab
Focus
Kernel reductions
Run shape
1 case / 3 profiles
Evidence
Selected report: Reduction strategy
Why it matters
Keeps shape, dtype, device, and block size fixed while comparing iterative and two-pass reduction strategies.
Reduction work introduces synchronization and partial writes, making it the first useful step beyond one-element-per-thread memory traffic.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Reduction variants stream well but still trail PyTorch
Both Triton reduction variants passed correctness and profiled with high DRAM throughput and occupancy, but still trailed the PyTorch baseline end to end.
PyTorch
0.1485 ms
float32 p50, 452.9 GB/s
Iterative Triton
0.1761 ms
381.8 GB/s, 93.59% profiled DRAM throughput
Two-pass Triton
0.1792 ms
375.2 GB/s, 98.89% profiled occupancy
Evidence boundary
The profiler points away from a simple memory-coalescing fix; the next change should target launch/finalization overhead or a different reduction structure.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Reduction strategy
Keep reduction variants as an active track, but do not claim a PyTorch win yet.
Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.
Decision recordRun shape
Run shape
1 reduction case across a 16,777,216-element float32 tensor, compared across PyTorch, iterative Triton, and two-pass Triton profiles.
Cases
1 case
reduction-sum-float32-16m
global sum over a 16M-element tensor
Run profiles
3 profiles
torch-baseline
framework reduction baseline
triton-reduction-iterative
iterative reduction strategy
triton-reduction-two-pass
partial reduction followed by final pass
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Traffic
Profiler evidence
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-memory --backend all --device cuda --op reduction_sum --dtype float32Example live command
./scripts/benchmark --run-id <run-id> --with-profiling