Profiler Validation
Do Nsight Compute counters confirm or challenge the benchmark interpretation for memory, fusion, reduction, and matmul kernels?
Project
CUDA Kernel Lab
Focus
Profiler evidence
Run shape
4 cases / 2 profiles
Evidence
Selected report: Nsight counters
Why it matters
Tracks compact profiler summaries that explain why memory primitives trail, why RMSNorm is credible, and why matmul still needs tuning.
Benchmark numbers show what moved; profiler counters are what make the optimization explanation credible.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Profiler counters turn benchmark results into optimization calls
Compact Nsight summaries now cover memory, reduction, RMSNorm, and matmul targets from the strategy run.
RMSNorm fp16
90.91% DRAM
93.12% occupancy, 40 registers/thread
Vector add fp32
91.58% DRAM
81.80% occupancy, 26 registers/thread
Matmul tile
45.19% Tensor Core
22.49% occupancy, 80 registers/thread
Evidence boundary
These counters explain the current claims; they do not make every faster p50 stable, and noisy rows still need reruns.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Fused RMSNorm
Use fusion for RMSNorm because it removes enough framework and intermediate-tensor overhead to beat PyTorch across the measured shape sweep.
fp16 reached 5.901x at 4096x8192; the 4096x4096 rerun reached 5.599x.
Decision recordMemory primitives
Do not broaden simple launch sweeps before profiler counters explain the PyTorch lead.
Triton vector_add profile shows 91.58% DRAM throughput rather than an obvious block-size fix.
Decision recordReduction strategy
Keep reduction variants as an active track, but do not claim a PyTorch win yet.
Both Triton reduction variants passed correctness and streamed well, but still trailed the PyTorch baseline.
Decision recordMatmul tile sweep
Treat Triton tiling as an active optimization track, not a portfolio win over cuBLAS yet.
Best Triton tile reached 25.74 TFLOP/s; PyTorch/cuBLAS stayed around 30-31 TFLOP/s.
Decision recordProfiler-backed proof
Use counters to explain benchmark results instead of treating latency alone as the answer.
RMSNorm profile shows 90.91% DRAM throughput and 93.12% occupancy; profiled matmul shows 45.19% Tensor Core utilization.
Decision recordRun shape
Run shape
Focused profiler targets cover vector_add, reduction_sum, RMSNorm, and tiled matmul captures.
Cases
4 cases
vector-add-float32-profile
memory primitive profiler target
reduction-sum-float32-profile
reduction strategy profiler target
rmsnorm-float16-profile
largest fusion win profiler target
matmul-float16-profile
Tensor Core tiling profiler target
Run profiles
2 profiles
ncu-full
Nsight Compute full counter set
compact-report
repo-committable markdown summary
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Memory system
Execution
Compute path
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run nsight-summary --input profiling/nsight_compute/<capture>.csv --output profiling/reports/<capture>.mdExample live command
./scripts/benchmark --run-id <run-id> --with-profiling