Decode Step Graph Replay
How far can resident-KV same-stream piecewise CUDA Graph replay reduce dynamic decode-step latency before a custom attention kernel?
Project
CUDA Kernel Lab
Focus
Decode scheduling
Run shape
3 cases / 4 profiles
Evidence
Selected report: Round 12 decode
Why it matters
Tracks a synthetic decode step where fused pre/post regions replay as CUDA Graphs around SDPA attention with head-major resident KV views.
Decode replay connects kernel optimization to serving mechanics: the question shifts from one custom kernel to launch overhead, dynamic buckets, resident cache layout, and tail latency.
Selected report
Result evidence
Selected live-cluster runs, readiness state, and evidence boundary are shown together.
Resident-KV graph replay defines the current decode upper bound
Round 12 benchmarks resident head-major KV views, same-stream piecewise CUDA Graph replay, eager post-add, dense active-batch buckets, and hot-loop timing with orchestration probes off.
Fixed-shape replay
0.1375 ms
same-stream piecewise graph p50 at batch=1, seq_len=2048
Dynamic replay
0.155-0.158 ms
dense-bucket same-stream p50 across three tail seeds
Tail
0.228-0.232 ms
p95 with zero padding and 27/27 correctness checks passing
Attention baseline
0.2273 ms
earlier PyTorch contiguous-KV p50 target for a future custom kernel
Decode replay progression
Path: naive eager
Call: baseline
p50: 0.3395 ms
p95: 0.3668 ms
Boundary: decomposed PyTorch
Path: fused graph
Call: supported
p50: 0.1482 ms
p95: 0.1589 ms
Boundary: fixed shape
Path: same-stream piecewise graph
Call: measured
p50: 0.1375 ms
p95: 0.1681 ms
Boundary: fixed shape
Path: dynamic dense buckets
Call: caveated
p50: 0.155-0.158 ms
p95: 0.228-0.232 ms
Boundary: resident-KV upper bound
Evidence boundary
Read this as a synthetic resident-KV upper bound, not an end-to-end serving result or a custom attention-kernel win. The p95 tail still deserves timing-probe and profiler follow-up.
Decision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Decode graph replay
Use resident-KV same-stream piecewise CUDA Graph replay as a measured synthetic upper bound, not an end-to-end serving claim.
Round 12 reached 0.1375 ms fixed-shape p50 and about 0.156 ms dynamic p50 / 0.230 ms p95.
Decision recordRun shape
Run shape
Fixed-shape and dynamic decode-step replay at max_batch_size=8, seq_len=2048, heads=16, head_dim=64, hidden_dim=1024, intermediate_dim=4096.
Cases
3 cases
decode-step-fixed-same-stream
fixed-shape same-stream piecewise CUDA Graph replay
decode-step-dynamic-dense-buckets
dynamic same-stream graph replay with dense active-batch buckets
decode-attention-baseline-context
earlier PyTorch contiguous-KV attention baseline kept as context
Run profiles
4 profiles
naive-eager
decomposed PyTorch decode step with eager launches
fused-graph
fused pre/post work replayed inside CUDA Graph capture
dynamic-piecewise-graph-same-stream
same-stream dynamic piecewise graph replay around SDPA attention
torch-contiguous-kv-baseline
PyTorch contiguous-KV attention baseline retained as the custom-kernel target
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Latency
Scheduling
Replay path
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
uv run benchmark-decode-step --dynamic-trace --mode all --device cuda --dtype float16 --attention-backend sdpa-head-major --dynamic-copy-mode resident --piecewise-post-mode eager --orchestration-timing offExample live command
./scripts/benchmark --run-id <run-id> --only-decode-step --include-decode-bucket-sweep --include-decode-tail-sweep --decode-attention-backend sdpa-head-major --decode-dynamic-copy-mode resident --decode-piecewise-post-mode eager --decode-orchestration-timing off --decode-tail-buckets '1,2,3,4,5,6,7,8'