TL

CUDA Kernel Lab Experiments

Project-linked experiments that turn GPU serving and kernel questions into evidence-backed decisions.

16 experiments2 projectsRun-ready5 supported9 selected1 rejected0 pending1 blocked

Rows show the current proof, focus area, and decisions that still need stronger evidence.

Experiment

Kernel memory traffic

Memory Primitive Bandwidth

Purpose

PyTorch still wins simple memory traffic.

Focus
GB/sp50 latencyCorrectness
Status
Run readySelected report · Memory bandwidth
Experiment

Kernel reductions

Reduction Strategy Comparison

Purpose

Two Triton reduction strategies trail PyTorch.

Focus
Reduction latencyGB/sStrategy tradeoff
Status
Run readySelected report · Reduction strategy
Experiment

Kernel fusion

Normalization Fusion

Purpose

RMSNorm fusion stays strong across the shape sweep.

Focus
SpeedupDRAM throughputOccupancy
Status
Run readySupported · RMSNorm fusion
Experiment

Kernel fusion

SwiGLU Elementwise Fusion

Purpose

Fused SwiGLU is a clean 3x-class win.

Focus
SpeedupGB/sp95 latency
Status
Run readySupported · SwiGLU fusion
Experiment

Kernel fusion

Row Softmax Fusion

Purpose

Current Triton softmax trails PyTorch.

Focus
p50 latencyNoiseGB/s
Status
Run readyRejected · Softmax win claim
Experiment

Kernel tiling

Matmul Tile Sweep

Purpose

Best Triton tile is measured but still below cuBLAS.

Focus
TFLOP/sTensor Core useOccupancy
Status
Run readySelected report · Tile sweep
Experiment

Tensor Core matmul

H200 Matmul Autotune

Purpose

Standard tiled-dot closes most of the H200 gap, but persistent waves are not a win yet.

Focus
TFLOP/sTriton/Torch %Persistent waves
Status
Run readySelected report · H200 matmul gap
Experiment

Decode scheduling

Decode Step Graph Replay

Purpose

Resident-KV graph replay is now measured and caveated.

Focus
p50 latencyp95 tailPaddingCorrectness
Status
Run readySelected report · Round 12 decode
Experiment

Profiler evidence

Profiler Validation

Purpose

Profiler counters now explain the strongest win and active gaps.

Focus
DRAM throughputOccupancyTensor Core use
Status
Run readySelected report · Nsight counters