Checkpoint Saving

1 Overview

Axolotl supports on-demand checkpoint saving during training. You can trigger checkpoints via file-based triggers (for programmatic control) or Control+C (for interactive use).

2 File-Based Checkpoint Trigger

2.1 Configuration

Enable in your config:

dynamic_checkpoint:
  enabled: true
  check_interval: 100  # Optional: check every N steps (default: 100)
  trigger_file_path: "axolotl_checkpoint.save"  # Optional: custom filename

Options: - enabled: true to enable (required) - check_interval: Steps between file checks. Default: 100. Lower = faster response, higher I/O overhead. - trigger_file_path: Custom trigger filename. Default: axolotl_checkpoint.save

2.2 How It Works

  1. Rank 0 checks for trigger file every check_interval steps in output_dir
  2. When detected, file is deleted and checkpoint is saved
  3. In distributed training, rank 0 broadcasts to synchronize all ranks

2.3 Usage

Command line:

touch /path/to/output_dir/axolotl_checkpoint.save

Programmatic:

from pathlib import Path
Path("/path/to/output_dir/axolotl_checkpoint.save").touch()

Checkpoint saves within the next check_interval steps. The trigger file is auto-deleted after detection, so you can create it multiple times.

Custom filename:

dynamic_checkpoint:
  enabled: true
  trigger_file_path: "my_trigger.save"
touch /path/to/output_dir/my_trigger.save

3 Control+C (SIGINT) Checkpoint

Pressing Ctrl+C during training saves the model state and exits gracefully. Note: This saves only the model weights, not optimizer state. For resumable checkpoints, use the file-based trigger.

4 Best Practices

  • Check interval: Lower values (10-50) for fast training, default 100 for slower training
  • Distributed training: Create trigger file once; rank 0 handles synchronization
  • Resume: Dynamic checkpoints can be resumed like regular checkpoints via resume_from_checkpoint

5 Example

output_dir: ./outputs/lora-out
save_steps: 500  # Scheduled checkpoints

dynamic_checkpoint:
  enabled: true
  check_interval: 50

This enables scheduled checkpoints every 500 steps plus on-demand saves via file trigger (checked every 50 steps).