Memory Primitive Bandwidth
Do simple Triton copy, scale, and vector_add kernels beat the optimized PyTorch memory path on A10G?
Project
CUDA Kernel Lab
Focus
Kernel memory traffic
Run shape
3 cases / 4 profiles
Evidence
Selected report: Memory bandwidth
Why it matters
Compares PyTorch and Triton memory primitives across copy, scale, vector_add, and reduction_sum rows.
Simple memory-bound kernels are the control group. If a custom kernel does not beat the framework path, the next step should be profiler explanation rather than a broad parameter sweep.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
PyTorch still owns simple memory primitives
The A10G strategy run kept PyTorch ahead on copy, scale, and vector_add, while the Triton vector_add profile showed high DRAM throughput rather than an obvious block-size fix.
Rows
18 memory rows
copy, scale, reduction_sum, and vector_add variants
PyTorch vector_add
467 GB/s
float32 p50 0.4311 ms
Triton profile
91.58% DRAM
float32 vector_add with 81.80% occupancy
Memory primitive comparison
Primitive: copy fp32
Fastest backend: PyTorch
p50: 0.2888 ms
GB/s: 464.8
Call: baseline leads
Primitive: scale fp32
Fastest backend: PyTorch
p50: 0.2929 ms
GB/s: 458.3
Call: baseline leads
Primitive: vector_add fp32
Fastest backend: PyTorch
p50: 0.4311 ms
GB/s: 467.0
Call: fusion next
Evidence boundary
This is not evidence against custom kernels broadly; it bounds simple memory traffic and redirects optimization toward fusion or reuse before wider launch sweeps.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Memory primitives
Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.
Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.
Decision recordRun shape
Run shape
3 memory cases across 16,777,216-element tensors, compared across PyTorch and Triton profiles.
Cases
3 cases
copy-float32-16m
device-to-device copy bandwidth
scale-float32-16m
single-input scale kernel bandwidth
vector-add-float32-16m
two-input vector add bandwidth
Run profiles
4 profiles
torch-baseline
framework memory primitive baseline
triton-block-size-512
Triton vector_add launch block size 512
triton-block-size-1024
Triton vector_add launch block size 1024
triton-block-size-2048
Triton vector_add launch block size 2048
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Memory
Validation
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-memory --backend all --op all --numel 16777216 --dtype float32Example live command
./scripts/benchmark --run-id <run-id> --with-profiling