Optimizers
Overview
Axolotl supports all optimizers supported by transformers OptimizerNames
Here is a list of optimizers supported by transformers as of v4.54.0
:
adamw_torch
adamw_torch_fused
adamw_torch_xla
adamw_torch_npu_fused
adamw_apex_fused
adafactor
adamw_anyprecision
adamw_torch_4bit
adamw_torch_8bit
ademamix
sgd
adagrad
adamw_bnb_8bit
adamw_8bit
# alias for adamw_bnb_8bitademamix_8bit
lion_8bit
lion_32bit
paged_adamw_32bit
paged_adamw_8bit
paged_ademamix_32bit
paged_ademamix_8bit
paged_lion_32bit
paged_lion_8bit
rmsprop
rmsprop_bnb
rmsprop_bnb_8bit
rmsprop_bnb_32bit
galore_adamw
galore_adamw_8bit
galore_adafactor
galore_adamw_layerwise
galore_adamw_8bit_layerwise
galore_adafactor_layerwise
lomo
adalomo
grokadamw
schedule_free_radam
schedule_free_adamw
schedule_free_sgd
apollo_adamw
apollo_adamw_layerwise
stable_adamw
Custom Optimizers
Enable custom optimizers by passing a string to the optimizer
argument. Each optimizer will receive beta and epsilon args, however, some may accept additional args which are detailed below.
optimi_adamw
optimizer: optimi_adamw
ao_adamw_4bit
Deprecated: Please use adamw_torch_4bit
.
ao_adamw_8bit
Deprecated: Please use adamw_torch_8bit
.
ao_adamw_fp8
optimizer: ao_adamw_fp8
adopt_adamw
GitHub: https://github.com/iShohei220/adopt Paper: https://arxiv.org/abs/2411.02853
optimizer: adopt_adamw
came_pytorch
GitHub: https://github.com/yangluo7/CAME/tree/master Paper: https://arxiv.org/abs/2307.02047
optimizer: came_pytorch
# optional args (defaults below)
adam_beta1: 0.9
adam_beta2: 0.999
adam_beta3: 0.9999
adam_epsilon: 1e-30
adam_epsilon2: 1e-16
muon
Blog: https://kellerjordan.github.io/posts/muon/ Paper: https://arxiv.org/abs/2502.16982v1
optimizer: muon
dion
Microsoft’s Dion (DIstributed OrthoNormalization) optimizer is a scalable and communication-efficient orthonormalizing optimizer that uses low-rank approximations to reduce gradient communication.
GitHub: https://github.com/microsoft/dion Paper: https://arxiv.org/pdf/2504.05295 Note: Implementation written for PyTorch 2.7+ for DTensor
optimizer: dion
dion_lr: 0.01
dion_momentum: 0.95
lr: 0.00001 # learning rate for embeddings and parameters that fallback to AdamW