Multi-GPU
This guide covers advanced training configurations for multi-GPU setups using Axolotl.
1 Overview
Axolotl supports several methods for multi-GPU training:
- DeepSpeed (recommended)
- FSDP (Fully Sharded Data Parallel)
- Sequence parallelism
- FSDP + QLoRA
2 DeepSpeed
2.1 Configuration
Add to your YAML config:
deepspeed: deepspeed_configs/zero1.json
2.2 Usage
# Fetch deepspeed configs (if not already present)
axolotl fetch deepspeed_configs
# Passing arg via config
axolotl train config.yml
# Passing arg via cli
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json
2.3 ZeRO Stages
We provide default configurations for:
- ZeRO Stage 1 (
zero1.json
) - ZeRO Stage 1 with torch compile (
zero1_torch_compile.json
) - ZeRO Stage 2 (
zero2.json
) - ZeRO Stage 3 (
zero3.json
) - ZeRO Stage 3 with bf16 (
zero3_bf16.json
) - ZeRO Stage 3 with bf16 and CPU offload params(
zero3_bf16_cpuoffload_params.json
) - ZeRO Stage 3 with bf16 and CPU offload params and optimizer (
zero3_bf16_cpuoffload_all.json
)
Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.
Start from Stage 1 -> Stage 2 -> Stage 3.
Using ZeRO Stage 3 with Single-GPU training
ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables:
WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500
3 Fully Sharded Data Parallel (FSDP)
FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.
3.1 Migrating from FSDP1 to FSDP2
To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version
top-level config field to specify the FSDP version, and
also follow the config field mapping below to update field names.
3.1.1 Config mapping
FSDP1 | FSDP2 |
---|---|
fsdp_sharding_strategy | reshard_after_forward |
fsdp_backward_prefetch_policy | REMOVED |
fsdp_backward_prefetch | REMOVED |
fsdp_forward_prefetch | REMOVED |
fsdp_sync_module_states | REMOVED |
fsdp_cpu_ram_efficient_loading | cpu_ram_efficient_loading |
fsdp_state_dict_type | state_dict_type |
fsdp_use_orig_params | REMOVED |
For example, if you were using the following FSDP1 config:
fsdp_version: 1
fsdp_config:
fsdp_offload_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
You can migrate to the following FSDP2 config:
fsdp_version: 2
fsdp_config:
offload_params: false
cpu_ram_efficient_loading: true
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen3DecoderLayer
state_dict_type: FULL_STATE_DICT
reshard_after_forward: true
3.2 FSDP1 (deprecated)
Using fsdp
to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config
as above instead.
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: true
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
4 Sequence parallelism
We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.
See our dedicated guide for more information.
4.1 FSDP + QLoRA
For combining FSDP with QLoRA, see our dedicated guide.
5 Performance Optimization
5.1 Liger Kernel Integration
Please see docs for more info.
6 Troubleshooting
6.1 NCCL Issues
For NCCL-related problems, see our NCCL troubleshooting guide.
6.2 Common Problems
- Reduce
micro_batch_size
- Reduce
eval_batch_size
- Adjust
gradient_accumulation_steps
- Consider using a higher ZeRO stage
- Start with DeepSpeed ZeRO-2
- Monitor loss values
- Check learning rates
For more detailed troubleshooting, see our debugging guide.