scatter2scatter INT64_INDICES bench
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Median of 10 iters, 3 warmup. top_k=8, dtype=bf16, 128 experts.
auto_int64 is the wrapper’s auto-dispatch verdict from _needs_int64_indices. At overflow shapes the int32 path is silently incorrect (the multiplication wraps mid-buffer), so only the int64 timing is reported.
| Shape | T | L_scattered | out elems | auto_int64 | int32 ms | int64 ms | int64 vs int32 (%) |
|---|---|---|---|---|---|---|---|
| small | 8192 | 65536 | 1.34e+08 | False | 2.699 | 2.704 | +0.2 |
| medium | 128000 | 1024000 | 2.10e+09 | False | 40.126 | 40.790 | +1.7 |
| overflow_524k_s16 | 32768 | 262144 | 4.29e+09 | True | — | 80.105 | — |
Acceptance: ≤5% regression on the int32 fast path at small/medium shapes (the auto-dispatch picks int32 there, so this row characterises the JIT overhead of having an int64 variant available).