Tony Lee

Decode Step Graph Replay

How far can resident-KV same-stream piecewise CUDA Graph replay reduce dynamic decode-step latency before a custom attention kernel?

Project

CUDA Kernel Lab

Focus

Decode scheduling

Run shape

3 cases / 4 profiles

Evidence

Selected report: Round 12 decode

Why it matters

Tracks a synthetic decode step where fused pre/post regions replay as CUDA Graphs around SDPA attention with head-major resident KV views.

Decode replay connects kernel optimization to serving mechanics: the question shifts from one custom kernel to launch overhead, dynamic buckets, resident cache layout, and tail latency.

Selected report

Result evidence

Selected live-cluster runs, readiness state, and evidence boundary are shown together.

Selected report: Round 12 decodeLatest reports: 2026-05-22

Resident-KV graph replay defines the current decode upper bound

Round 12 benchmarks resident head-major KV views, same-stream piecewise CUDA Graph replay, eager post-add, dense active-batch buckets, and hot-loop timing with orchestration probes off.

Fixed-shape replay

0.1375 ms

same-stream piecewise graph p50 at batch=1, seq_len=2048

Dynamic replay

0.155-0.158 ms

dense-bucket same-stream p50 across three tail seeds

Tail

0.228-0.232 ms

p95 with zero padding and 27/27 correctness checks passing

Attention baseline

0.2273 ms

earlier PyTorch contiguous-KV p50 target for a future custom kernel

Decode replay progression

Path: naive eager

Call: baseline

p50: 0.3395 ms

p95: 0.3668 ms

Boundary: decomposed PyTorch

Path: fused graph

Call: supported

p50: 0.1482 ms

p95: 0.1589 ms

Boundary: fixed shape

Path: same-stream piecewise graph

Call: measured

p50: 0.1375 ms

p95: 0.1681 ms

Boundary: fixed shape

Path: dynamic dense buckets

Call: caveated

p50: 0.155-0.158 ms

p95: 0.228-0.232 ms

Boundary: resident-KV upper bound

Evidence boundary

Read this as a synthetic resident-KV upper bound, not an end-to-end serving result or a custom attention-kernel win. The p95 tail still deserves timing-probe and profiler follow-up.

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

CaveatedDecode replay caveats

Decode graph replay

Use resident-KV same-stream piecewise CUDA Graph replay as a measured synthetic upper bound, not an end-to-end serving claim.

Round 12 reached 0.1375 ms fixed-shape p50 and about 0.156 ms dynamic p50 / 0.230 ms p95.

Decision record

Run shape

Run shape

Fixed-shape and dynamic decode-step replay at max_batch_size=8, seq_len=2048, heads=16, head_dim=64, hidden_dim=1024, intermediate_dim=4096.

Cases

3 cases

decode-step-fixed-same-stream

fixed-shape same-stream piecewise CUDA Graph replay

0.1375 ms p50batch=1, seq_len=2048, float16

decode-step-dynamic-dense-buckets

dynamic same-stream graph replay with dense active-batch buckets

0.155-0.158 ms p500.228-0.232 ms p95, zero padding

decode-attention-baseline-context

earlier PyTorch contiguous-KV attention baseline kept as context

0.2273 ms p50seq_len=2048, heads=16, head_dim=128

Run profiles

4 profiles

naive-eager

decomposed PyTorch decode step with eager launches

fused-graph

fused pre/post work replayed inside CUDA Graph capture

dynamic-piecewise-graph-same-stream

same-stream dynamic piecewise graph replay around SDPA attention

torch-contiguous-kv-baseline

PyTorch contiguous-KV attention baseline retained as the custom-kernel target

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Latency

p50 latencyp95 latencyp99 latencyp95/p50 noise

Scheduling

graph hit ratebucket policypadding wastescheduler p95

Replay path

resident KV viewssame-stream replayeager post-addcorrectness

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

uv run benchmark-decode-step --dynamic-trace --mode all --device cuda --dtype float16 --attention-backend sdpa-head-major --dynamic-copy-mode resident --piecewise-post-mode eager --orchestration-timing off

Example live command

./scripts/benchmark --run-id <run-id> --only-decode-step --include-decode-bucket-sweep --include-decode-tail-sweep --decode-attention-backend sdpa-head-major --decode-dynamic-copy-mode resident --decode-piecewise-post-mode eager --decode-orchestration-timing off --decode-tail-buckets '1,2,3,4,5,6,7,8'