Optimizers

Configuring optimizers

Overview

Axolotl supports all optimizers supported by transformers OptimizerNames

Here is a list of optimizers supported by transformers as of v4.54.0:

  • adamw_torch
  • adamw_torch_fused
  • adamw_torch_xla
  • adamw_torch_npu_fused
  • adamw_apex_fused
  • adafactor
  • adamw_anyprecision
  • adamw_torch_4bit
  • adamw_torch_8bit
  • ademamix
  • sgd
  • adagrad
  • adamw_bnb_8bit
  • adamw_8bit # alias for adamw_bnb_8bit
  • ademamix_8bit
  • lion_8bit
  • lion_32bit
  • paged_adamw_32bit
  • paged_adamw_8bit
  • paged_ademamix_32bit
  • paged_ademamix_8bit
  • paged_lion_32bit
  • paged_lion_8bit
  • rmsprop
  • rmsprop_bnb
  • rmsprop_bnb_8bit
  • rmsprop_bnb_32bit
  • galore_adamw
  • galore_adamw_8bit
  • galore_adafactor
  • galore_adamw_layerwise
  • galore_adamw_8bit_layerwise
  • galore_adafactor_layerwise
  • lomo
  • adalomo
  • grokadamw
  • schedule_free_radam
  • schedule_free_adamw
  • schedule_free_sgd
  • apollo_adamw
  • apollo_adamw_layerwise
  • stable_adamw

Custom Optimizers

Enable custom optimizers by passing a string to the optimizer argument. Each optimizer will receive beta and epsilon args, however, some may accept additional args which are detailed below.

optimi_adamw

optimizer: optimi_adamw

ao_adamw_4bit

Deprecated: Please use adamw_torch_4bit.

ao_adamw_8bit

Deprecated: Please use adamw_torch_8bit.

ao_adamw_fp8

optimizer: ao_adamw_fp8

adopt_adamw

GitHub: https://github.com/iShohei220/adopt Paper: https://arxiv.org/abs/2411.02853

optimizer: adopt_adamw

came_pytorch

GitHub: https://github.com/yangluo7/CAME/tree/master Paper: https://arxiv.org/abs/2307.02047

optimizer: came_pytorch

# optional args (defaults below)
adam_beta1: 0.9
adam_beta2: 0.999
adam_beta3: 0.9999
adam_epsilon: 1e-30
adam_epsilon2: 1e-16

muon

Blog: https://kellerjordan.github.io/posts/muon/ Paper: https://arxiv.org/abs/2502.16982v1

optimizer: muon

dion

Microsoft’s Dion (DIstributed OrthoNormalization) optimizer is a scalable and communication-efficient orthonormalizing optimizer that uses low-rank approximations to reduce gradient communication.

GitHub: https://github.com/microsoft/dion Paper: https://arxiv.org/pdf/2504.05295 Note: Implementation written for PyTorch 2.7+ for DTensor

optimizer: dion
dion_lr: 0.01
dion_momentum: 0.95
lr: 0.00001  # learning rate for embeddings and parameters that fallback to AdamW