Tony Lee

Memory Primitive Bandwidth

Do simple Triton copy, scale, and vector_add kernels beat the optimized PyTorch memory path on A10G?

Project

CUDA Kernel Lab

Focus

Kernel memory traffic

Run shape

3 cases / 4 profiles

Evidence

Selected report: Memory bandwidth

Why it matters

Compares PyTorch and Triton memory primitives across copy, scale, vector_add, and reduction_sum rows.

Simple memory-bound kernels are the control group. If a custom kernel does not beat the framework path, the next step should be profiler explanation rather than a broad parameter sweep.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Memory bandwidthLatest reports: 2026-05-21

PyTorch still owns simple memory primitives

The A10G strategy run kept PyTorch ahead on copy, scale, and vector_add, while the Triton vector_add profile showed high DRAM throughput rather than an obvious block-size fix.

Rows

18 memory rows

copy, scale, reduction_sum, and vector_add variants

PyTorch vector_add

467 GB/s

float32 p50 0.4311 ms

Triton profile

91.58% DRAM

float32 vector_add with 81.80% occupancy

Memory primitive comparison

Primitive: copy fp32

Fastest backend: PyTorch

p50: 0.2888 ms

GB/s: 464.8

Call: baseline leads

Primitive: scale fp32

Fastest backend: PyTorch

p50: 0.2929 ms

GB/s: 458.3

Call: baseline leads

Primitive: vector_add fp32

Fastest backend: PyTorch

p50: 0.4311 ms

GB/s: 467.0

Call: fusion next

Evidence boundary

This is not evidence against custom kernels broadly; it bounds simple memory traffic and redirects optimization toward fusion or reuse before wider launch sweeps.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

CaveatedMemory/reduction boundaries

Memory primitives

Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.

Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.

Decision record

Run shape

Run shape

3 memory cases across 16,777,216-element tensors, compared across PyTorch and Triton profiles.

Cases

3 cases

copy-float32-16m

device-to-device copy bandwidth

16,777,216 elementsfloat32

scale-float32-16m

single-input scale kernel bandwidth

16,777,216 elementsfloat32

vector-add-float32-16m

two-input vector add bandwidth

16,777,216 elementsfloat32

Run profiles

4 profiles

torch-baseline

framework memory primitive baseline

triton-block-size-512

Triton vector_add launch block size 512

triton-block-size-1024

Triton vector_add launch block size 1024

triton-block-size-2048

Triton vector_add launch block size 2048

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencyp99 latency

Memory

effective GB/sestimated bytes movednoise ratio

Validation

correctnessdevicedtype

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-memory --backend all --op all --numel 16777216 --dtype float32

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling