Multi-GPU

This guide covers advanced training configurations for multi-GPU setups using Axolotl.

Overview

When training on multiple GPUs, Axolotl supports 3 sharding/parallelism strategies. Additionally, you can layer specific optimization features on top of that strategy.

You generally cannot combine these strategies; they are mutually exclusive.

  1. DeepSpeed: Powerful optimization library, supports ZeRO stages 1-3.
  2. FSDP (Fully Sharded Data Parallel): PyTorch’s native sharding implementation (Recommended).
  3. DDP (Distributed Data Parallel): PyTorch’s native parallelism implementation (Default if neither of the above are selected).

These features can often be combined with the strategies above:

  • Sequence Parallelism: Splits long sequences across GPUs (Compatible with DDP, DeepSpeed, and FSDP).
  • FSDP + QLoRA: Combines 4-bit quantization with FSDP (Specific to FSDP).

DeepSpeed

Configuration

Add to your YAML config:

deepspeed: deepspeed_configs/zero1.json

Usage

# Fetch deepspeed configs (if not already present)
axolotl fetch deepspeed_configs

# Passing arg via config
axolotl train config.yml

# Passing arg via cli
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json

ZeRO Stages

We provide default configurations for:

  • ZeRO Stage 1 (zero1.json)
  • ZeRO Stage 1 with torch compile (zero1_torch_compile.json)
  • ZeRO Stage 2 (zero2.json)
  • ZeRO Stage 3 (zero3.json)
  • ZeRO Stage 3 with bf16 (zero3_bf16.json)
  • ZeRO Stage 3 with bf16 and CPU offload params(zero3_bf16_cpuoffload_params.json)
  • ZeRO Stage 3 with bf16 and CPU offload params and optimizer (zero3_bf16_cpuoffload_all.json)
Tip

Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.

Start from Stage 1 -> Stage 2 -> Stage 3.

Fully Sharded Data Parallel (FSDP)

FSDP allows you to shard model parameters, gradients, and optimizer states across data parallel workers.

Note

FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.

FSDP + QLoRA

For combining FSDP with QLoRA, see our dedicated guide.

Migrating from FSDP1 to FSDP2

To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names.

Config mapping

FSDP1 FSDP2
fsdp_sharding_strategy reshard_after_forward
fsdp_backward_prefetch_policy REMOVED
fsdp_backward_prefetch REMOVED
fsdp_forward_prefetch REMOVED
fsdp_sync_module_states REMOVED
fsdp_cpu_ram_efficient_loading cpu_ram_efficient_loading
fsdp_state_dict_type state_dict_type
fsdp_use_orig_params REMOVED
fsdp_activation_checkpointing activation_checkpointing

For more details, please see the migration guide in the torchtitan repo. In Axolotl, if you were using the following FSDP1 config:

fsdp_version: 1
fsdp_config:
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

You can migrate to the following FSDP2 config:

fsdp_version: 2
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true

FSDP1 (deprecated)

Note

Using fsdp to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config as above instead.

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

Sequence parallelism

We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.

See our dedicated guide for more information.

Performance Optimization

Liger Kernel Integration

Please see docs for more info.

Troubleshooting

NCCL Issues

For NCCL-related problems, see our NCCL troubleshooting guide.

Common Problems

  • Reduce micro_batch_size
  • Reduce eval_batch_size
  • Adjust gradient_accumulation_steps
  • Consider using a higher ZeRO stage
  • Start with DeepSpeed ZeRO-2
  • Monitor loss values
  • Check learning rates

For more detailed troubleshooting, see our debugging guide.