monkeypatch.attention.fp8_attn
monkeypatch.attention.fp8_attn
FP8 low-precision attention via torchao.
Requires
- PyTorch >= 2.11.0
- SM90+ (Hopper/Blackwell) GPU
- flash-attn package with FA3 support
- torchao >= 0.17.0
Uses per-head FP8 quantized attention with automatic RoPE fusion under torch.compile. The torchao patch replaces F.scaled_dot_product_attention, so the model must use HF’s “sdpa” attention implementation for the patch to intercept attention calls.
Functions
| Name | Description |
|---|---|
| patch_fp8_attention | Apply FP8 low-precision attention to a model. |
patch_fp8_attention
monkeypatch.attention.fp8_attn.patch_fp8_attention(model)Apply FP8 low-precision attention to a model.
Must be called after model loading and before torch.compile. KV caching should be disabled (config.use_cache = False).