Tony Lee

Profiler Validation

Do Nsight Compute counters confirm or challenge the benchmark interpretation for memory, fusion, reduction, and matmul kernels?

Project

CUDA Kernel Lab

Focus

Profiler evidence

Run shape

4 cases / 2 profiles

Evidence

Selected report: Nsight counters

Why it matters

Tracks compact profiler summaries that explain why memory primitives trail, why RMSNorm is credible, and why matmul still needs tuning.

Benchmark numbers show what moved; profiler counters are what make the optimization explanation credible.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Nsight countersLatest reports: 2026-05-21

Profiler counters turn benchmark results into optimization calls

Compact Nsight summaries now cover memory, reduction, RMSNorm, and matmul targets from the strategy run.

RMSNorm fp16

90.91% DRAM

93.12% occupancy, 40 registers/thread

Vector add fp32

91.58% DRAM

81.80% occupancy, 26 registers/thread

Matmul tile

45.19% Tensor Core

22.49% occupancy, 80 registers/thread

Evidence boundary

These counters explain the current claims; they do not make every faster p50 stable, and noisy rows still need reruns.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

SupportedFusion wins

Fused RMSNorm

Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.

fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.

Decision record
CaveatedMemory/reduction boundaries

Memory primitives

Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.

Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.

Decision record
CaveatedMemory/reduction boundaries

Reduction strategy

Keep reduction variants as an active track, but do not claim a PyTorch win yet.

Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.

Decision record
CaveatedMatmul/Tensor Core gaps

Matmul tile sweep

Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.

Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.

Decision record
MeasuredMatmul/Tensor Core gaps

Profiler-backed proof

Use counters to explain benchmark results instead of treating latency alone as the answer.

RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.

Decision record

Run shape

Run shape

Focused profiler targets cover vector_add, reduction_sum, RMSNorm, and tiled matmul captures.

Cases

4 cases

vector-add-float32-profile

memory primitive profiler target

vector_addfloat32

reduction-sum-float32-profile

reduction strategy profiler target

reduction_sumfloat32

rmsnorm-float16-profile

largest fusion win profiler target

RMSNormfloat16

matmul-float16-profile

Tensor Core tiling profiler target

matmulfloat16

Run profiles

2 profiles

ncu-full

Nsight Compute full counter set

compact-report

repo-committable markdown summary

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Memory system

achieved memory throughputload/store efficiencycache behavior

Execution

occupancyregisters per threadshared memory per block

Compute path

Tensor Core utilizationinstruction mixeligible warps

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run nsight-summary --input profiling/nsight_compute/<capture>.csv --output profiling/reports/<capture>.md

Example live command

./scripts/benchmark --run-id <run-id> --with-profiling