Tony Lee

FP4 Quantization Optimization

Does SmoothQuant improve NVFP4 W4A4 accuracy recovery enough to justify its memory, latency, throughput, and cost tradeoffs?

Project

GPU Inference Decision Lab

Focus

Quantization

Run shape

2 cases / 3 profiles

Evidence

Blocked: Blackwell capacity

Why it matters

Defines a Blackwell FP4 comparison for BF16, plain NVFP4, and SmoothQuant, with latency, throughput, accuracy, memory, serving cost, and build cost tracked separately.

Quantization helps only if memory or cost gains survive accuracy, latency, throughput, and build-cost tradeoffs. This experiment keeps FP4 claims measurable.

Blocked

Result status

Definition is ready; the live attempt stopped before a comparable result.

Blackwell capacity blocked

Blackwell capacity blocked the live run. Store generated Markdown, JSON, and client logs in docs/reports/ until a comparable result is selected.

Next runs to curate

Capacity retry

Retry only when p6-b200.48xlarge capacity is available or reserved in the target us-west-2 zones.

./scripts/up

Plain NVFP4 build

Produce the first quantized artifact before comparing BF16, plain NVFP4, and SmoothQuant serving behavior.

./scripts/experiment render-quantization --experiment fp4 --profile nvfp4-plain

SmoothQuant comparison

Measure whether SmoothQuant improves NVFP4 accuracy recovery enough to justify its extra build cost.

./scripts/experiment render-accuracy --experiment fp4 --profile nvfp4-smoothquant

Decision links

Supports decisions

Decision records own the project conclusions; this experiment supplies evidence for the calls below.

BlockedQuantization + hardware

Blackwell FP4

Hold the FP4 architecture decision until B200 results exist.

EC2 UnfulfillableCapacity; no quantized artifact produced.

Decision record

Run shape

Run shape

2 cases across 512-2,048 prompt tokens and 128 output tokens, paired with 3 run profiles.

Cases

2 cases

steady-512-output-128

steady latency, throughput, and serving cost comparison

512 prompt tokens128 output tokens

prefill-2048-output-128

prefill-heavy memory and TTFT comparison

2,048 prompt tokens128 output tokens

Run profiles

3 profiles

bf16-baseline

Qwen2.5 7B BF16 baseline on a full p6-b200 instance

nvfp4-plain

NVFP4 W4A4 artifact without pre-optimization

nvfp4-smoothquant

SmoothQuant preprocessing followed by NVFP4 W4A4 quantization

Measurement focus

Metrics to capture

Metric groups show the signal this experiment needs.

Serving

p95 request latencyp99 request latencyrequests/secgenerated tokens/sec

Accuracy and memory

average accuracy scoreFP4 recovery vs BF16GPU memory usedGPU memory free

Cost

serving costcost per 1M generated tokensquantization build cost

Usage

How to run

Examples show one local render path and one live-cluster path.

Example local command

./scripts/experiment show fp4

Example live command

./scripts/experiment run --experiment fp4 --case steady-512-output-128 --profile bf16-baseline