CUDA Kernel Lab Experiments

Project-linked experiments that turn GPU serving and kernel questions into evidence-backed decisions.

16 experiments2 projectsRun-ready5 supported9 selected1 rejected0 pending1 blocked

Catalog readyView project decisions →

Rows show the current proof, focus area, and decisions that still need stronger evidence.

Experiment

Kernel memory traffic

Memory Primitive Bandwidth

Purpose

PyTorch still wins simple memory traffic.

Focus

GB/sp50 latencyCorrectness

Status

Run readySelected report · Memory bandwidth

DetailsView details →

Experiment

Kernel reductions

Reduction Strategy Comparison

Purpose

Two Triton reduction strategies trail PyTorch.

Focus

Reduction latencyGB/sStrategy tradeoff

Status

Run readySelected report · Reduction strategy

DetailsView details →

Experiment

Kernel fusion

Normalization Fusion

Purpose

RMSNorm fusion stays strong across the shape sweep.

Focus

SpeedupDRAM throughputOccupancy

Status

Run readySupported · RMSNorm fusion

DetailsView details →

Experiment

Kernel fusion

SwiGLU Elementwise Fusion

Purpose

Fused SwiGLU is a clean 3x-class win.

Focus

SpeedupGB/sp95 latency

Status

Run readySupported · SwiGLU fusion

DetailsView details →

Experiment

Kernel fusion

Row Softmax Fusion

Purpose

Current Triton softmax trails PyTorch.

Focus

p50 latencyNoiseGB/s

Status

Run readyRejected · Softmax win claim

DetailsView details →

Experiment

Kernel tiling

Matmul Tile Sweep

Purpose

Best Triton tile is measured but still below cuBLAS.

Focus

TFLOP/sTensor Core useOccupancy

Status

Run readySelected report · Tile sweep

DetailsView details →

Experiment

Tensor Core matmul

H200 Matmul Autotune

Purpose

Standard tiled-dot closes most of the H200 gap, but persistent waves are not a win yet.

Focus

TFLOP/sTriton/Torch %Persistent waves

Status

Run readySelected report · H200 matmul gap

DetailsView details →

Experiment

Decode scheduling

Decode Step Graph Replay

Purpose

Resident-KV graph replay is now measured and caveated.

Focus

p50 latencyp95 tailPaddingCorrectness

Status

Run readySelected report · Round 12 decode

DetailsView details →

Experiment

Profiler evidence

Profiler Validation

Purpose

Profiler counters now explain the strongest win and active gaps.

Focus

DRAM throughputOccupancyTensor Core use

Status

Run readySelected report · Nsight counters

DetailsView details →