Reduction Strategy Comparison

Does an iterative Triton reduction or a two-pass Triton reduction perform better for a 16M-element float32 sum?

Experiment catalog Source

Project

CUDA Kernel Lab

Focus

Kernel reductions

Run shape

1 case / 3 profiles

Evidence

Selected report: Reduction strategy

Why it matters

Keeps shape, dtype, device, and block size fixed while comparing iterative and two-pass reduction strategies.

Reduction work introduces synchronization and partial writes, making it the first useful step beyond one-element-per-thread memory traffic.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Reduction strategyLatest reports: 2026-05-21

Reduction variants stream well but still trail PyTorch

Both Triton reduction variants passed correctness and profiled with high DRAM throughput and occupancy, but still trailed the PyTorch baseline end to end.

PyTorch

0.1485 ms

float32 p50, 452.9 GB/s

Iterative Triton

0.1761 ms

381.8 GB/s, 93.59% profiled DRAM throughput

Two-pass Triton

0.1792 ms

375.2 GB/s, 98.89% profiled occupancy

Evidence boundary

The profiler points away from a simple memory-coalescing fix; the next change should target launch/finalization overhead or a different reduction structure.

Results summary Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

CaveatedMemory/reduction boundaries

Reduction strategy

Keep reduction variants as an active track, but do not claim a PyTorch win yet.

Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.

Decision record

Run shape

1 reduction case across a 16,777,216-element float32 tensor, compared across PyTorch, iterative Triton, and two-pass Triton profiles.

Cases

1 case

reduction-sum-float32-16m

global sum over a 16M-element tensor

16,777,216 elementsfloat32

Run profiles

3 profiles

torch-baseline

framework reduction baseline

triton-reduction-iterative

iterative reduction strategy

triton-reduction-two-pass

partial reduction followed by final pass

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencyp99 latency

Traffic

effective GB/sestimated reads and writespartial tensor traffic

Profiler evidence

occupancymemory throughputregister pressure

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-memory --backend all --device cuda --op reduction_sum --dtype float32

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling

Source Results summary