Multi-GPU

This guide covers advanced training configurations for multi-GPU setups using Axolotl.

1 Overview

Axolotl supports several methods for multi-GPU training:

DeepSpeed (recommended)
FSDP (Fully Sharded Data Parallel)
Sequence parallelism
FSDP + QLoRA

2 DeepSpeed

2.1 Configuration

Add to your YAML config:

deepspeed: deepspeed_configs/zero1.json

2.2 Usage

# Fetch deepspeed configs (if not already present)
axolotl fetch deepspeed_configs

# Passing arg via config
axolotl train config.yml

# Passing arg via cli
axolotl train config.yml --deepspeed deepspeed_configs/zero1.json

2.3 ZeRO Stages

We provide default configurations for:

ZeRO Stage 1 (zero1.json)
ZeRO Stage 1 with torch compile (zero1_torch_compile.json)
ZeRO Stage 2 (zero2.json)
ZeRO Stage 3 (zero3.json)
ZeRO Stage 3 with bf16 (zero3_bf16.json)
ZeRO Stage 3 with bf16 and CPU offload params(zero3_bf16_cpuoffload_params.json)
ZeRO Stage 3 with bf16 and CPU offload params and optimizer (zero3_bf16_cpuoffload_all.json)

Tip

Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance.

Start from Stage 1 -> Stage 2 -> Stage 3.

Tip

Using ZeRO Stage 3 with Single-GPU training

ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables: WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500

3 Fully Sharded Data Parallel (FSDP)

Note

FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl.

3.1 Migrating from FSDP1 to FSDP2

To migrate your config from FSDP1 to FSDP2, you must use the fsdp_version top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names.

3.1.1 Config mapping

FSDP1	FSDP2
fsdp_sharding_strategy	reshard_after_forward
fsdp_backward_prefetch_policy	REMOVED
fsdp_backward_prefetch	REMOVED
fsdp_forward_prefetch	REMOVED
fsdp_sync_module_states	REMOVED
fsdp_cpu_ram_efficient_loading	cpu_ram_efficient_loading
fsdp_state_dict_type	state_dict_type
fsdp_use_orig_params	REMOVED

For example, if you were using the following FSDP1 config:

fsdp_version: 1
fsdp_config:
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

You can migrate to the following FSDP2 config:

fsdp_version: 2
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true

3.2 FSDP1 (deprecated)

Note

Using fsdp to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use fsdp_config as above instead.

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: true
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

4 Sequence parallelism

We support sequence parallelism (SP) via the ring-flash-attention project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training.

See our dedicated guide for more information.

4.1 FSDP + QLoRA

For combining FSDP with QLoRA, see our dedicated guide.

5 Performance Optimization

5.1 Liger Kernel Integration

Please see docs for more info.

6 Troubleshooting

6.1 NCCL Issues

For NCCL-related problems, see our NCCL troubleshooting guide.

Reduce micro_batch_size
Reduce eval_batch_size
Adjust gradient_accumulation_steps
Consider using a higher ZeRO stage

Start with DeepSpeed ZeRO-2
Monitor loss values
Check learning rates

For more detailed troubleshooting, see our debugging guide.

--- title: "Multi-GPU" format: html: toc: true toc-depth: 3 number-sections: true code-tools: true execute: enabled: false --- This guide covers advanced training configurations for multi-GPU setups using Axolotl. ## Overview {#sec-overview} Axolotl supports several methods for multi-GPU training: - DeepSpeed (recommended) - FSDP (Fully Sharded Data Parallel) - Sequence parallelism - FSDP + QLoRA ## DeepSpeed {#sec-deepspeed} ### Configuration {#sec-deepspeed-config} Add to your YAML config: ```{.yaml} deepspeed: deepspeed_configs/zero1.json ``` ### Usage {#sec-deepspeed-usage} ```{.bash} # Fetch deepspeed configs (if not already present) axolotl fetch deepspeed_configs # Passing arg via config axolotl train config.yml # Passing arg via cli axolotl train config.yml --deepspeed deepspeed_configs/zero1.json ``` ### ZeRO Stages {#sec-zero-stages} We provide default configurations for: - ZeRO Stage 1 (`zero1.json`) - ZeRO Stage 1 with torch compile (`zero1_torch_compile.json`) - ZeRO Stage 2 (`zero2.json`) - ZeRO Stage 3 (`zero3.json`) - ZeRO Stage 3 with bf16 (`zero3_bf16.json`) - ZeRO Stage 3 with bf16 and CPU offload params(`zero3_bf16_cpuoffload_params.json`) - ZeRO Stage 3 with bf16 and CPU offload params and optimizer (`zero3_bf16_cpuoffload_all.json`) ::: {.callout-tip} Choose the configuration that offloads the least amount to memory while still being able to fit on VRAM for best performance. Start from Stage 1 -> Stage 2 -> Stage 3. ::: ::: {.callout-tip} Using ZeRO Stage 3 with Single-GPU training ZeRO Stage 3 can be used for training on a single GPU by manually setting the environment variables: `WORLD_SIZE=1 LOCAL_RANK=0 MASTER_ADDR=0.0.0.0 MASTER_PORT=29500` ::: ## Fully Sharded Data Parallel (FSDP) {#sec-fsdp} ::: {.callout-note} FSDP2 is recommended for new users. FSDP1 is deprecated and will be removed in an upcoming release of Axolotl. ::: ### Migrating from FSDP1 to FSDP2 {#sec-migrate-fsdp1-fsdp2} To migrate your config from FSDP1 to FSDP2, you must use the `fsdp_version` top-level config field to specify the FSDP version, and also follow the config field mapping below to update field names. #### Config mapping FSDP1 | FSDP2 -------- | -------- fsdp_sharding_strategy | reshard_after_forward fsdp_backward_prefetch_policy | **REMOVED** fsdp_backward_prefetch | **REMOVED** fsdp_forward_prefetch | **REMOVED** fsdp_sync_module_states | **REMOVED** fsdp_cpu_ram_efficient_loading | cpu_ram_efficient_loading fsdp_state_dict_type | state_dict_type fsdp_use_orig_params | **REMOVED** For example, if you were using the following FSDP1 config: ```{.yaml} fsdp_version: 1 fsdp_config: fsdp_offload_params: false fsdp_cpu_ram_efficient_loading: true fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer fsdp_state_dict_type: FULL_STATE_DICT fsdp_sharding_strategy: FULL_SHARD ``` You can migrate to the following FSDP2 config: ```{.yaml} fsdp_version: 2 fsdp_config: offload_params: false cpu_ram_efficient_loading: true auto_wrap_policy: TRANSFORMER_BASED_WRAP transformer_layer_cls_to_wrap: Qwen3DecoderLayer state_dict_type: FULL_STATE_DICT reshard_after_forward: true ``` ### FSDP1 (deprecated) {#sec-fsdp-config} ::: {.callout-note} Using `fsdp` to configure FSDP is deprecated and will be removed in an upcoming release of Axolotl. Please use `fsdp_config` as above instead. ::: ```{.yaml} fsdp: - full_shard - auto_wrap fsdp_config: fsdp_offload_params: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer ``` ## Sequence parallelism {#sec-sequence-parallelism} We support sequence parallelism (SP) via the [ring-flash-attention](https://github.com/zhuzilin/ring-flash-attention) project. This allows one to split up sequences across GPUs, which is useful in the event that a single sequence causes OOM errors during model training. See our [dedicated guide](sequence_parallelism.qmd) for more information. ### FSDP + QLoRA {#sec-fsdp-qlora} For combining FSDP with QLoRA, see our [dedicated guide](fsdp_qlora.qmd). ## Performance Optimization {#sec-performance} ### Liger Kernel Integration {#sec-liger} Please see [docs](custom_integrations.qmd#liger) for more info. ## Troubleshooting {#sec-troubleshooting} ### NCCL Issues {#sec-nccl} For NCCL-related problems, see our [NCCL troubleshooting guide](nccl.qmd). ### Common Problems {#sec-common-problems} ::: {.panel-tabset} ## Memory Issues - Reduce `micro_batch_size` - Reduce `eval_batch_size` - Adjust `gradient_accumulation_steps` - Consider using a higher ZeRO stage ## Training Instability - Start with DeepSpeed ZeRO-2 - Monitor loss values - Check learning rates ::: For more detailed troubleshooting, see our [debugging guide](debugging.qmd).