Preference Learning (RLHF) — Agent Reference
Reference for DPO, IPO, KTO, ORPO, and SimPO. For config templates and dataset format examples, see rlhf.qmd. For GRPO, see grpo.qmd. For EBFT, see ebft.qmd.
Method Overview
| Method | Data Requirement | Key Idea | Best For |
|---|---|---|---|
| DPO | Paired (chosen + rejected) | Implicit reward via preference pairs | General alignment, most common |
| IPO | Paired (chosen + rejected) | DPO with different loss (avoids overfitting) | When DPO overfits |
| KTO | Unpaired (completion + binary label) | Kahneman-Tversky loss, no pairs needed | When you only have thumbs-up/down |
| ORPO | Paired (chosen + rejected) | Combined SFT + preference, no ref model | Single-stage alignment, saves VRAM |
| SimPO | Paired (chosen + rejected) | Length-normalized, no ref model | Simple setup, length-robust |
Default: start with DPO. All methods require sample_packing: false.
Architecture
┌──────────────┐ ┌───────────────┐ ┌───────────────┐
│ Policy Model │ │ Reference │ │ Preference │
│ (trainable) │ │ Model (frozen)│ │ Dataset │
└──────┬───────┘ └──────┬────────┘ └──────┬────────┘
└──────────┬───────┘ │
v │
Forward pass on chosen + rejected <─────┘
│
Preference Loss (DPO/IPO/KTO/...)
│
Backprop + Update
Exception: ORPO and SimPO do NOT use a reference model (~50% less VRAM).
No vLLM server needed (unlike GRPO). Offline RL with pre-collected preference data.
Method Selection
- Paired preference data (chosen + rejected)?
- Default →
rl: dpo - Overfitting →
rl: ipo - VRAM-limited →
rl: orpo(no ref model) - Length-sensitive →
rl: simpo(no ref model)
- Default →
- Only binary labels (good/bad)? →
rl: kto - Single-stage training (no separate SFT)? →
rl: orpo
| DPO | IPO | KTO | ORPO | SimPO | |
|---|---|---|---|---|---|
| Reference model | Yes | Yes | Yes | No | No |
| VRAM overhead | ~2x model | ~2x model | ~2x model | ~1x model | ~1x model |
| TRL trainer class | DPOTrainer | DPOTrainer | KTOTrainer | ORPOTrainer | CPOTrainer |
Prompt Strategy Resolution
The type field resolves to a Python function:
type: "chatml.intel"
→ axolotl.prompt_strategies.dpo.chatml.intel(cfg, **kwargs)
→ returns transform_fn(sample) → {"prompt", "chosen", "rejected"}
type: "chat_template.default"
→ axolotl.prompt_strategies.dpo.chat_template.default(cfg, dataset_idx, **kwargs)
type: {"field_prompt": "prompt", ...} (dict)
→ axolotl.prompt_strategies.dpo.user_defined.default(...)
Module base: axolotl.prompt_strategies.{rl_method} — replace dpo with kto or orpo.
Healthy Training Indicators
| Metric | Healthy Range | Problem |
|---|---|---|
train/loss |
Decreasing, 0.3-0.7 | Flat or increasing = broken data or too high LR |
rewards/chosen |
Increasing | Flat = model not learning preferences |
rewards/rejected |
Decreasing | Increasing = model prefers wrong responses |
rewards/margins |
Positive and increasing | Negative = prefers rejected over chosen |
rewards/accuracies |
> 0.5, toward 0.7+ | < 0.5 = worse than random |
logps/rejected |
Decreasing | Increasing = reward hacking |
grad_norm |
0.01 - 10.0 | > 100 = exploding gradients |
Method-specific: DPO/IPO watch rewards/margins; KTO loss is noisier; ORPO monitor SFT + odds ratio components; SimPO check length-normalized reward separation.
Known Issues
| Issue | Fix |
|---|---|
| Sample packing crash | Set sample_packing: false (required for all preference methods) |
KTO KeyError: 'label' |
Ensure dataset has boolean label column |
ORPO/KTO KeyError during tokenization |
Add remove_unused_columns: false |
| ORPO template not applied | ORPO requires explicit chat_template setting |
| OOM with ref model (DPO/IPO/KTO) | Use LoRA/QLoRA, or switch to ORPO/SimPO (no ref model) |
| IPO + label_smoothing | Do not set dpo_label_smoothing when rl: ipo |
Full troubleshooting: training_stability.qmd
File Map
src/axolotl/
core/trainers/dpo/ # DPO trainer, args, strategy
core/builders/rl.py # HFRLTrainerBuilder — routes rl type → trainer class
core/training_args.py # AxolotlKTOConfig, AxolotlORPOConfig, AxolotlCPOConfig
prompt_strategies/
dpo/ # DPO/IPO/SimPO dataset strategies
chat_template.py # chat_template.default, chat_template.argilla_chat
chatml.py # chatml.default/intel/icr/argilla_chat/prompt_pairs/ultra
llama3.py # llama3 variants (same subtypes as chatml)
user_defined.py # Custom field mapping
passthrough.py # No transform
kto/ # KTO dataset strategies (chatml, llama3, user_defined)
orpo/ # ORPO dataset strategies (chat_template.argilla)
utils/schemas/enums.py # RLType enum (dpo, ipo, kto, orpo, simpo, grpo, gdpo, ebft)
utils/schemas/config.py # All rl/dpo/kto/orpo/simpo config fields
docs/rlhf.qmd # Full user docs: all dataset formats, config templates
docs/choosing_method.qmd # SFT vs DPO vs GRPO decision guide
examples/qwen2/dpo.yaml # DPO example
examples/llama-3/qlora-1b-kto.yaml # KTO example