ScatterMoE LoRA — MXFP4 benchmark
Routing mode: dense — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- Shape: E=128, K=2048, N=1024, top_k=8, M=4096, rank=16 (active experts = 128)
- Iters: 10 warmup + 50 timed, fwd+bwd per iter
- HBM peak (datasheet): 1792 GB/s
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 5.25 | 6244998 | 1252.8 | 105.5 | 5.9 |
| Strategy A (selective dequant) | 30.57 | 1071778 | 8557.3 | 18.1 | 1.0 |
| Strategy B (fused MX) | 12.24 | 2677582 | 1425.3 | 13.0 | 0.7 |
Routing mode: sparse — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- Shape: E=256, K=2048, N=1024, top_k=8, M=4096, rank=16 (active experts = 10)
- Iters: 10 warmup + 50 timed, fwd+bwd per iter
- HBM peak (datasheet): 1792 GB/s
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 6.55 | 5006027 | 1960.8 | 9.0 | 0.5 |
| Strategy A (selective dequant) | 5.75 | 5695789 | 2059.9 | 10.2 | 0.6 |
| Strategy B (fused MX) | 8.95 | 3661270 | 1997.8 | 3.1 | 0.2 |
Routing mode: balanced — M sweep — NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
- Base shape: E=256, K=2048, N=1024, top_k=8, rank=16
- M values: 256, 1024, 4096, 16384
- Iters: 10 warmup + 50 timed, fwd+bwd per iter
- HBM peak (datasheet): 1792 GB/s
Summary (ms/iter, fwd+bwd)
| M | active / E | bf16 ms | Strategy A ms | Strategy B ms | winner (A vs B) |
|---|---|---|---|---|---|
| 256 | 215/256 (0.84) | 2.99 | OOM | 8.24 | B |
| 1024 | 251/256 (0.98) | 3.43 | OOM | 10.74 | B |
| 4096 | 255/256 (1.00) | 6.56 | OOM | 16.50 | B |
| 16384 | 256/256 (1.00) | 24.15 | OOM | 46.56 | B |
M=256 (active experts = 215 / 256, num_active/E = 0.840)
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 2.99 | 685596 | 1686.0 | 302.2 | 16.9 |
| Strategy A (selective dequant) | OOM | OOM | OOM | OOM | OOM |
| Strategy B (fused MX) | 8.24 | 248639 | 1954.9 | 29.2 | 1.6 |
M=1024 (active experts = 251 / 256, num_active/E = 0.980)
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 3.43 | 2389143 | 1744.2 | 308.3 | 17.2 |
| Strategy A (selective dequant) | OOM | OOM | OOM | OOM | OOM |
| Strategy B (fused MX) | 10.74 | 762567 | 2058.1 | 26.4 | 1.5 |
M=4096 (active experts = 255 / 256, num_active/E = 0.996)
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 6.56 | 4994760 | 1960.8 | 165.6 | 9.2 |
| Strategy A (selective dequant) | OOM | OOM | OOM | OOM | OOM |
| Strategy B (fused MX) | 16.50 | 1985884 | 2280.0 | 18.2 | 1.0 |
M=16384 (active experts = 256 / 256, num_active/E = 1.000)
| Config | ms/iter | tokens/s | peak mem (MB) | HBM GB/s | HBM % |
|---|---|---|---|---|---|
| bf16 baseline | 24.15 | 5427073 | 2827.0 | 47.2 | 2.6 |
| Strategy A (selective dequant) | OOM | OOM | OOM | OOM | OOM |
| Strategy B (fused MX) | 46.56 | 2814943 | 3149.0 | 7.6 | 0.4 |
Notes
- Strategy A OOMs at all M under load-balanced routing at E=256 because the torchao MXTensor dequant path materializes several full-shape fp32/int32 unpack buffers (~12 GiB combined for [256, 1024, 2048] at fp4 → fp32) while vLLM colocated on this workstation pins ~88 GB of HBM, leaving only ~14 GB free. Extrapolating from the dense E=128 case above (Strategy A peak ~8.6 GB at 128 active experts), the E=256 / 256-active dequant peak would be ~17 GB — over the available headroom.
- Active-expert count is essentially E at every sampled M. Under a
load-balance-regularized router (per-token N(0,1) noise + N(0,0.5) per-expert
bias),
E[active] ≈ E · (1 − (1 − top_k/E)^M). With E=256 / top_k=8 this yields ≥ 215 unique experts even at M=256 and saturates at 256 by M ≈ 16K. Balanced routing therefore does not generate a low-active regime at these token counts — i.e. the A-vs-B crossover does not appear in this sweep; B wins by default because A does not fit. - B vs bf16: Strategy B is consistently 1.9–2.9× slower than the bf16 baseline (similar to the dense E=128 ratio of ~2.3×). HBM utilization for both is modest (B 0.4–1.6 %, bf16 2.6–17.2 %), suggesting the kernels are compute- or scheduling-bound for these shapes, not bandwidth-bound.
- Where the A-vs-B crossover lives, by theory: Strategy A is preferred
when
num_active / Eis small enough that the dequant cost is offset by the cheaper bf16 matmul — the priorsparserow (10/256 active, A=5.75 ms vs B=8.95 ms) sits in that regime. Strategy B is preferred nearnum_active / E ≈ 1, where dequant of all experts dominates. The threshold between the two — somewhere in the 10/256 to 215/256 band — is not observable from the balanced-router setting; eliciting it would need an M smaller than 256, a synthetic deliberately-sparse router, or freeing the vLLM GPU and rerunning at E=256.