Checkpoint Saving
1 Overview
Axolotl supports on-demand checkpoint saving during training. You can trigger checkpoints via file-based triggers (for programmatic control) or Control+C (for interactive use).
2 File-Based Checkpoint Trigger
2.1 Configuration
Enable in your config:
dynamic_checkpoint:
enabled: true
check_interval: 100 # Optional: check every N steps (default: 100)
trigger_file_path: "axolotl_checkpoint.save" # Optional: custom filenameOptions:
- enabled: true to enable (required)
- check_interval: Steps between file checks. Default: 100. Lower = faster response, higher I/O overhead.
- trigger_file_path: Custom trigger filename. Default: axolotl_checkpoint.save
2.2 How It Works
- Rank 0 checks for trigger file every
check_intervalsteps inoutput_dir - When detected, file is deleted and checkpoint is saved
- In distributed training, rank 0 broadcasts to synchronize all ranks
2.3 Usage
Command line:
touch /path/to/output_dir/axolotl_checkpoint.saveProgrammatic:
from pathlib import Path
Path("/path/to/output_dir/axolotl_checkpoint.save").touch()Checkpoint saves within the next check_interval steps. The trigger file is auto-deleted after detection, so you can create it multiple times.
Custom filename:
dynamic_checkpoint:
enabled: true
trigger_file_path: "my_trigger.save"touch /path/to/output_dir/my_trigger.save3 Control+C (SIGINT) Checkpoint
Pressing Ctrl+C during training saves the model state and exits gracefully. Note: This saves only the model weights, not optimizer state. For resumable checkpoints, use the file-based trigger.
4 Best Practices
- Check interval: Lower values (10-50) for fast training, default 100 for slower training
- Distributed training: Create trigger file once; rank 0 handles synchronization
- Resume: Dynamic checkpoints can be resumed like regular checkpoints via
resume_from_checkpoint
5 Example
output_dir: ./outputs/lora-out
save_steps: 500 # Scheduled checkpoints
dynamic_checkpoint:
enabled: true
check_interval: 50This enables scheduled checkpoints every 500 steps plus on-demand saves via file trigger (checked every 50 steps).