Profiler Validation

Do Nsight Compute counters confirm or challenge the benchmark interpretation for memory, fusion, reduction, and matmul kernels?

Experiment catalog Source

Project

CUDA Kernel Lab

Focus

Profiler evidence

Run shape

4 cases / 2 profiles

Evidence

Selected report: Nsight counters

Why it matters

Tracks compact profiler summaries that explain why memory primitives trail, why RMSNorm is credible, and why matmul still needs tuning.

Benchmark numbers show what moved; profiler counters are what make the optimization explanation credible.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Nsight countersLatest reports: 2026-05-21

Profiler counters turn benchmark results into optimization calls

Compact Nsight summaries now cover memory, reduction, RMSNorm, and matmul targets from the strategy run.

RMSNorm fp16

90.91% DRAM

93.12% occupancy, 40 registers/thread

Vector add fp32

91.58% DRAM

81.80% occupancy, 26 registers/thread

Matmul tile

45.19% Tensor Core

22.49% occupancy, 80 registers/thread

Evidence boundary

These counters explain the current claims; they do not make every faster p50 stable, and noisy rows still need reruns.

Results summary Report rules

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedFusion wins

Fused RMSNorm

Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.

fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.

Decision record

CaveatedMemory/reduction boundaries

Memory primitives

Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.

Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.

Decision record

CaveatedMemory/reduction boundaries

Reduction strategy

Keep reduction variants as an active track, but do not claim a PyTorch win yet.

Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.

Decision record

CaveatedMatmul/Tensor Core gaps

Matmul tile sweep

Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.

Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.

Decision record

MeasuredMatmul/Tensor Core gaps

Profiler-backed proof

Use counters to explain benchmark results instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Decision record

Run shape

Focused profiler targets cover vector_add, reduction_sum, RMSNorm, and tiled matmul captures.

Cases

4 cases

vector-add-float32-profile

memory primitive profiler target

vector_addfloat32

reduction-sum-float32-profile

reduction strategy profiler target

reduction_sumfloat32

rmsnorm-float16-profile

largest fusion win profiler target

RMSNormfloat16

matmul-float16-profile

Tensor Core tiling profiler target

matmulfloat16

Run profiles

2 profiles

ncu-full

Nsight Compute full counter set

compact-report

repo-committable markdown summary

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Memory system

achieved memory throughputload/store efficiencycache behavior

Execution

occupancyregisters per threadshared memory per block

Compute path

Tensor Core utilizationinstruction mixeligible warps

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run nsight-summary --input profiling/nsight_compute/<capture>.csv --output profiling/reports/<capture>.md

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling

Source Results summary