ScatterMoE LoRA — MXFP4 benchmark

Routing mode: dense — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  • Shape: E=128, K=2048, N=1024, top_k=8, M=4096, rank=16 (active experts = 128)
  • Iters: 10 warmup + 50 timed, fwd+bwd per iter
  • HBM peak (datasheet): 1792 GB/s
Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 5.25 6244998 1252.8 105.5 5.9
Strategy A (selective dequant) 30.57 1071778 8557.3 18.1 1.0
Strategy B (fused MX) 12.24 2677582 1425.3 13.0 0.7

Routing mode: sparse — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  • Shape: E=256, K=2048, N=1024, top_k=8, M=4096, rank=16 (active experts = 10)
  • Iters: 10 warmup + 50 timed, fwd+bwd per iter
  • HBM peak (datasheet): 1792 GB/s
Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 6.55 5006027 1960.8 9.0 0.5
Strategy A (selective dequant) 5.75 5695789 2059.9 10.2 0.6
Strategy B (fused MX) 8.95 3661270 1997.8 3.1 0.2

Routing mode: balanced — M sweep — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  • Base shape: E=256, K=2048, N=1024, top_k=8, rank=16
  • M values: 256, 1024, 4096, 16384
  • Iters: 10 warmup + 50 timed, fwd+bwd per iter
  • HBM peak (datasheet): 1792 GB/s

Summary (ms/iter, fwd+bwd)

M active / E bf16 ms Strategy A ms Strategy B ms winner (A vs B)
256 215/256 (0.84) 2.99 OOM 8.24 B
1024 251/256 (0.98) 3.43 OOM 10.74 B
4096 255/256 (1.00) 6.56 OOM 16.50 B
16384 256/256 (1.00) 24.15 OOM 46.56 B

M=256 (active experts = 215 / 256, num_active/E = 0.840)

Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 2.99 685596 1686.0 302.2 16.9
Strategy A (selective dequant) OOM OOM OOM OOM OOM
Strategy B (fused MX) 8.24 248639 1954.9 29.2 1.6

M=1024 (active experts = 251 / 256, num_active/E = 0.980)

Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 3.43 2389143 1744.2 308.3 17.2
Strategy A (selective dequant) OOM OOM OOM OOM OOM
Strategy B (fused MX) 10.74 762567 2058.1 26.4 1.5

M=4096 (active experts = 255 / 256, num_active/E = 0.996)

Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 6.56 4994760 1960.8 165.6 9.2
Strategy A (selective dequant) OOM OOM OOM OOM OOM
Strategy B (fused MX) 16.50 1985884 2280.0 18.2 1.0

M=16384 (active experts = 256 / 256, num_active/E = 1.000)

Config ms/iter tokens/s peak mem (MB) HBM GB/s HBM %
bf16 baseline 24.15 5427073 2827.0 47.2 2.6
Strategy A (selective dequant) OOM OOM OOM OOM OOM
Strategy B (fused MX) 46.56 2814943 3149.0 7.6 0.4

Notes

  • Strategy A OOMs at all M under load-balanced routing at E=256 because the torchao MXTensor dequant path materializes several full-shape fp32/int32 unpack buffers (~12 GiB combined for [256, 1024, 2048] at fp4 → fp32) while vLLM colocated on this workstation pins ~88 GB of HBM, leaving only ~14 GB free. Extrapolating from the dense E=128 case above (Strategy A peak ~8.6 GB at 128 active experts), the E=256 / 256-active dequant peak would be ~17 GB — over the available headroom.
  • Active-expert count is essentially E at every sampled M. Under a load-balance-regularized router (per-token N(0,1) noise + N(0,0.5) per-expert bias), E[active] ≈ E · (1 − (1 − top_k/E)^M). With E=256 / top_k=8 this yields ≥ 215 unique experts even at M=256 and saturates at 256 by M ≈ 16K. Balanced routing therefore does not generate a low-active regime at these token counts — i.e. the A-vs-B crossover does not appear in this sweep; B wins by default because A does not fit.
  • B vs bf16: Strategy B is consistently 1.9–2.9× slower than the bf16 baseline (similar to the dense E=128 ratio of ~2.3×). HBM utilization for both is modest (B 0.4–1.6 %, bf16 2.6–17.2 %), suggesting the kernels are compute- or scheduling-bound for these shapes, not bandwidth-bound.
  • Where the A-vs-B crossover lives, by theory: Strategy A is preferred when num_active / E is small enough that the dequant cost is offset by the cheaper bf16 matmul — the prior sparse row (10/256 active, A=5.75 ms vs B=8.95 ms) sits in that regime. Strategy B is preferred near num_active / E ≈ 1, where dequant of all experts dominates. The threshold between the two — somewhere in the 10/256 to 215/256 band — is not observable from the balanced-router setting; eliciting it would need an M smaller than 256, a synthetic deliberately-sparse router, or freeing the vLLM GPU and rerunning at E=256.