Which Fine-Tuning Method Should I Use?

A decision guide for choosing the right fine-tuning method, adapter, and hardware configuration in Axolotl.

1 Overview

Axolotl supports four broad categories of fine-tuning, each suited to different data types, objectives, and resource constraints.

Method What It Does Data You Need
Supervised Fine-Tuning (SFT) Teaches the model to produce specific outputs given inputs Input-output pairs (instructions, conversations, completions)
Preference Learning (DPO/KTO/ORPO) Steers the model toward preferred outputs and away from dispreferred ones Chosen/rejected response pairs (DPO, ORPO) or binary labels (KTO)
Reinforcement Learning (GRPO) Optimizes the model against a reward signal through online generation A reward function (code or model-based) and a prompt dataset
Reward Modeling Trains a model to score responses, for use as a reward signal in RL Preference pairs ranked by quality

Each method is configured through a YAML file with rl: <method> (or omitted for SFT). All methods support LoRA, QLoRA, and full fine-tuning unless otherwise noted.

2 Decision Tree

Use the following flowchart to choose your method. Start at the top and follow the path that matches your situation.

Do you have a reward function (code-based or model-based)?
├── YES
│   └── Use GRPO (rl: grpo)
│       The model generates its own completions and learns from reward scores.
│       Best for: math, code, reasoning, tasks with verifiable answers.
│       See: rlhf.qmd#grpo
│
└── NO
    │
    Do you have preference pairs (chosen vs. rejected responses)?
    ├── YES
    │   │
    │   Are they paired (same prompt, one chosen, one rejected)?
    │   ├── YES → Use DPO (rl: dpo)
    │   │         Direct optimization without a separate reward model.
    │   │         See: rlhf.qmd#dpo
    │   │
    │   └── NO (only binary good/bad labels)
    │       └── Use KTO (rl: kto)
    │           Works with unpaired preference data.
    │           See: rlhf.qmd#kto
    │
    └── NO
        │
        Do you have input-output examples?
        ├── YES → Use SFT
        │         The simplest and most common method.
        │         See: getting-started.qmd
        │
        └── NO
            └── You need to create training data first.
                Consider generating preference pairs with an LLM judge,
                or writing a reward function for GRPO.
Tip

When in doubt, start with SFT. It is the most straightforward method and works well for most tasks. You can always move to preference learning or RL later to further refine behavior.

2.1 Method Comparison at a Glance

Criterion SFT DPO KTO GRPO
Data complexity Low (input-output pairs) Medium (preference pairs) Medium (binary labels) Low (prompts + reward code)
Compute cost Low Medium Medium High (requires vLLM server)
Learning signal Supervised Contrastive Contrastive Online reward
Online generation No No No Yes
Reward model needed No No No No (uses reward functions)
Best for Task adaptation, instruction following Safety, style alignment Unpaired preference data Reasoning, math, code
Note

ORPO is an alternative to DPO that combines SFT and preference optimization in a single training stage, removing the need for a separate SFT step. Configure with rl: orpo. See rlhf.qmd for details.

3 Adapter Selection

Once you have chosen a method, decide how to apply the parameter updates. The three main options trade off VRAM usage against model quality.

3.1 QLoRA

  • How it works: The base model is loaded in 4-bit (NF4) quantization. Small low-rank adapter matrices are trained in higher precision on top.
  • VRAM savings: Roughly 4x reduction in model memory compared to full fine-tuning.
  • Quality: Slight degradation due to quantization noise, but often negligible for task-specific fine-tuning.
  • When to use: When your GPU cannot fit the model in full precision, or when you want fast experimentation.
adapter: qlora
load_in_4bit: true
lora_r: 32
lora_alpha: 64
lora_target_linear: true

3.2 LoRA

  • How it works: The base model is loaded at full precision (or 8-bit). Low-rank adapter matrices are trained alongside.
  • VRAM savings: Roughly 2-3x reduction compared to full fine-tuning (model weights are frozen, only adapters + optimizer states for adapters are stored).
  • Quality: Very close to full fine-tuning for most tasks, especially with higher rank values.
  • When to use: When you have enough VRAM for the base model but not for full optimizer states.
adapter: lora
lora_r: 32
lora_alpha: 64
lora_target_linear: true
Tip

For GRPO training, LoRA is strongly recommended. The vLLM server needs to sync weights from the trainer, and LoRA sync (trl.vllm_lora_sync: true) is far more efficient than syncing full merged weights. See vLLM Serving for details.

3.3 Full Fine-Tuning

  • How it works: All model parameters are updated during training. No adapters.
  • VRAM savings: None. Requires memory for model weights, gradients, and optimizer states (roughly 4x model size in bf16 with AdamW).
  • Quality: Highest potential quality, especially for large distribution shifts.
  • When to use: When you have ample GPU memory or multi-GPU setups, and need maximum performance. Also required for pre-training.
# No adapter or load_in_* lines needed
micro_batch_size: 1
gradient_accumulation_steps: 16

3.4 Quick Comparison

QLoRA LoRA Full
Trainable params ~0.1-1% ~0.1-1% 100%
Model memory ~25% of full ~50-100% of full 100%
Optimizer memory Tiny (adapters only) Tiny (adapters only) 2x model size (AdamW)
Training speed Slower (dequantization overhead) Baseline Faster per-step (no adapter overhead)
Inference Merge or serve with adapter Merge or serve with adapter Direct
Multi-GPU required? Rarely For 13B+ models For 7B+ models

4 Hardware Mapping

The tables below provide approximate GPU memory requirements. Actual usage depends on context length, batch size, and optimizer choice.

4.1 SFT / Preference Learning

Model Size QLoRA (4-bit) LoRA (bf16) Full (bf16 + AdamW)
1-3B 6-8 GB 8-12 GB 24-32 GB
7-8B 10-14 GB 16-24 GB 60-80 GB
13-14B 16-20 GB 28-40 GB 120+ GB
30-34B 24-32 GB 64-80 GB 2-4x 80 GB
70-72B 40-48 GB 2x 80 GB 4-8x 80 GB
Important

These estimates assume a short context length (512-2048 tokens) and micro_batch_size of 1-2. Longer sequences and larger batches increase memory significantly due to activations. Use gradient checkpointing to reduce activation memory at the cost of ~30% slower training.

4.2 GRPO (RL Training)

GRPO requires additional GPU(s) for the vLLM generation server. Plan for at least two GPUs: one for training, one for vLLM.

Model Size Training GPU (LoRA, bf16) vLLM GPU Total GPUs
0.5-3B 1x 24 GB 1x 24 GB 2x 24 GB
7-8B 1x 80 GB 1x 80 GB 2x 80 GB
13-14B 1-2x 80 GB 1-2x 80 GB 2-4x 80 GB
30-72B 2-4x 80 GB (FSDP/DeepSpeed) 2-4x 80 GB (tensor parallel) 4-8x 80 GB
Tip

For single-GPU GRPO, use vllm_mode: colocate with vllm_enable_sleep_mode: true. The vLLM engine shares the GPU and offloads VRAM when not generating. This works for smaller models (up to ~3B on a 24 GB GPU) but is slower than the two-GPU server mode.

4.3 Multi-GPU Threshold

You need multi-GPU training when:

  • Full fine-tuning of models 7B+ (use FSDP or DeepSpeed ZeRO)
  • LoRA of models 30B+ (or 13B+ with long contexts)
  • GRPO almost always (separate vLLM server), unless using colocate mode

See Multi-GPU Training for FSDP and DeepSpeed configuration.