Attention
SDP Attention
This is the default built-in attention in PyTorch.
sdp_attention: trueFor more details: PyTorch docs
Flash Attention 2
Uses efficient kernels to compute attention.
flash_attention: trueFor more details: Flash Attention
Nvidia
Requirements: Ampere, Ada, or Hopper GPUs
Note: For Turing GPUs or lower, please use other attention methods.
pip install flash-attn --no-build-isolationIf you get undefined symbol while training, ensure you installed PyTorch prior to Axolotl. Alternatively, try reinstall or downgrade a version.
Flash Attention 3
Requirements: Hopper only and CUDA 12.8 (recommended)
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py installAMD
Requirements: ROCm 6.0 and above.
Flex Attention
A flexible PyTorch API for attention used in combination with torch.compile.
flex_attention: true
# recommended
torch_compile: trueWe recommend using latest stable version of PyTorch for best performance.
For more details: PyTorch docs
SageAttention
Attention kernels with QK Int8 and PV FP16 accumulator.
sage_attention: trueRequirements: Ampere, Ada, or Hopper GPUs
pip install sageattention==2.2.0 --no-build-isolationOnly LoRA/QLoRA recommended at the moment. We found loss drop to 0 for full finetuning. See GitHub Issue.
For more details: Sage Attention
We do not support SageAttention 3 at the moment. If you are interested on adding this or improving SageAttention implementation, please make an Issue.
xFormers
xformers_attention: trueWe recommend using with Turing GPUs or below (such as on Colab).
For more details: xFormers
Shifted Sparse Attention
We plan to deprecate this! If you use this feature, we recommend switching to methods above.
Requirements: LLaMA model architecture
flash_attention: true
s2_attention: trueNo sample packing support!