FP4 Quantization Optimization
Does SmoothQuant improve NVFP4 W4A4 accuracy recovery enough to justify its memory, latency, throughput, and cost tradeoffs?
Project
GPU Inference Decision Lab
Focus
Quantization
Run shape
2 cases / 3 profiles
Evidence
Blocked: Blackwell capacity
Why it matters
Defines a Blackwell FP4 comparison for BF16, plain NVFP4, and SmoothQuant, with latency, throughput, accuracy, memory, serving cost, and build cost tracked separately.
Quantization helps only if memory or cost gains survive accuracy, latency, throughput, and build-cost tradeoffs. This experiment keeps FP4 claims measurable.
Blocked
Result status
Definition is ready; the live attempt stopped before a comparable result.
Blackwell capacity blocked
Blackwell capacity blocked the live run. Store generated Markdown, JSON, and client logs in docs/reports/ until a comparable result is selected.
Next runs to curate
Capacity retry
Retry only when p6-b200.48xlarge capacity is available or reserved in the target us-west-2 zones.
./scripts/upPlain NVFP4 build
Produce the first quantized artifact before comparing BF16, plain NVFP4, and SmoothQuant serving behavior.
./scripts/experiment render-quantization --experiment fp4 --profile nvfp4-plainSmoothQuant comparison
Measure whether SmoothQuant improves NVFP4 accuracy recovery enough to justify its extra build cost.
./scripts/experiment render-accuracy --experiment fp4 --profile nvfp4-smoothquantDecision links
Supports decisions
Decision records own the project conclusions; this experiment supplies evidence for the calls below.
Blackwell FP4
Hold the FP4 architecture decision until B200 results exist.
EC2 UnfulfillableCapacity; no quantized artifact produced.
Decision recordRun shape
Run shape
2 cases across 512-2,048 prompt tokens and 128 output tokens, paired with 3 run profiles.
Cases
2 cases
steady-512-output-128
steady latency, throughput, and serving cost comparison
prefill-2048-output-128
prefill-heavy memory and TTFT comparison
Run profiles
3 profiles
bf16-baseline
Qwen2.5 7B BF16 baseline on a full p6-b200 instance
nvfp4-plain
NVFP4 W4A4 artifact without pre-optimization
nvfp4-smoothquant
SmoothQuant preprocessing followed by NVFP4 W4A4 quantization
Measurement focus
Metrics to capture
Metric groups show the signal this experiment needs.
Serving
Accuracy and memory
Cost
Usage
How to run
Examples show one local render path and one live-cluster path.
Example local command
./scripts/experiment show fp4Example live command
./scripts/experiment run --experiment fp4 --case steady-512-output-128 --profile bf16-baseline