scatter2scatter INT64_INDICES bench

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Median of 10 iters, 3 warmup. top_k=8, dtype=bf16, 128 experts.

auto_int64 is the wrapper’s auto-dispatch verdict from _needs_int64_indices. At overflow shapes the int32 path is silently incorrect (the multiplication wraps mid-buffer), so only the int64 timing is reported.

Shape T L_scattered out elems auto_int64 int32 ms int64 ms int64 vs int32 (%)
small 8192 65536 1.34e+08 False 2.699 2.704 +0.2
medium 128000 1024000 2.10e+09 False 40.126 40.790 +1.7
overflow_524k_s16 32768 262144 4.29e+09 True 80.105

Acceptance: ≤5% regression on the int32 fast path at small/medium shapes (the auto-dispatch picks int32 there, so this row characterises the JIT overhead of having an int64 variant available).