Multi-GPU

This guide covers advanced training configurations for multi-GPU setups using Axolotl.

1 Overview

Axolotl supports several methods for multi-GPU training:

  • DeepSpeed (recommended)
  • FSDP (Fully Sharded Data Parallel)
  • Sequence parallelism
  • FSDP + QLoRA

2 DeepSpeed

2.1 Configuration

Add to your YAML config:

deepspeed: deepspeed_configs/zero1.json

2.2 Usage

# Fetch deepspeed configs (if not already present)
axolotl fetch deepspeed_configs

# Passing arg via config
axolotl train config.yml

# Passing arg via cli
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json

2.3 ZeRO Stages

We provide default configurations for:

  • ZeRO Stage 1 (zero1.json)
  • ZeRO Stage 1 with torch compile (zero1_torch_compile.json)
  • ZeRO Stage 2 (zero2.json)
  • ZeRO Stage 3 (zero3.json)
  • ZeRO Stage 3 with bf16 (zero3_bf16.json)
  • ZeRO Stage 3 with bf16 and CPU offload params(zero3_bf16_cpuoffload_params.json)
  • ZeRO Stage 3 with bf16 and CPU offload params and optimizer (zero3_bf16_cpuoffload_all.json)
Tip

Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.

Start from Stage 1 -> Stage 2 -> Stage 3.

Tip

Using ZeRO Stage 3 with Single-GPU training

ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables: WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500

3 Fully Sharded Data Parallel (FSDP)

Note

FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.

3.1 Migrating from FSDP1 to FSDP2

To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names.

3.1.1 Config mapping

FSDP1 FSDP2
fsdp_sharding_strategy reshard_after_forward
fsdp_backward_prefetch_policy REMOVED
fsdp_backward_prefetch REMOVED
fsdp_forward_prefetch REMOVED
fsdp_sync_module_states REMOVED
fsdp_cpu_ram_efficient_loading cpu_ram_efficient_loading
fsdp_state_dict_type state_dict_type
fsdp_use_orig_params REMOVED

For example, if you were using the following FSDP1 config:

fsdp_version: 1
fsdp_config:
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

You can migrate to the following FSDP2 config:

fsdp_version: 2
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true

3.2 FSDP1 (deprecated)

Note

Using fsdp to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config as above instead.

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

4 Sequence parallelism

We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.

See our dedicated guide for more information.

4.1 FSDP + QLoRA

For combining FSDP with QLoRA, see our dedicated guide.

5 Performance Optimization

5.1 Liger Kernel Integration

Please see docs for more info.

6 Troubleshooting

6.1 NCCL Issues

For NCCL-related problems, see our NCCL troubleshooting guide.

6.2 Common Problems

  • Reduce micro_batch_size
  • Reduce eval_batch_size
  • Adjust gradient_accumulation_steps
  • Consider using a higher ZeRO stage
  • Start with DeepSpeed ZeRO-2
  • Monitor loss values
  • Check learning rates

For more detailed troubleshooting, see our debugging guide.