vLLM Serving for GRPO Training

How to configure and run vLLM as a generation backend for GRPO reinforcement learning in Axolotl.

1 Overview

GRPO (Group Relative Policy Optimization) trains a language model by generating completions, scoring them with reward functions, and updating the policy to favor higher-reward outputs. The generation step is the bottleneck: producing thousands of tokens per training step with the policy model is slow using standard HuggingFace generation.

Axolotl uses vLLM as a high-throughput generation backend. vLLM runs as a separate process (either on a dedicated GPU or colocated on the training GPU) and serves completions via an HTTP API. The trainer sends prompts to vLLM, receives completions, scores them, and performs gradient updates.

┌──────────────────────┐       HTTP        ┌──────────────────────┐
│   Trainer (GPU 1)    │ ───────────────── │   vLLM Server (GPU 0)│
│                      │  prompts/compls   │                      │
│  - Policy model      │ ◄──────────────── │  - Same base model   │
│  - Reward scoring    │                   │  - Fast generation   │
│  - Gradient updates  │  weight sync      │  - LoRA adapter      │
│  - LoRA adapter      │ ─────────────────►│    (periodically     │
│                      │  (every N steps)  │     updated)         │
└──────────────────────┘                   └──────────────────────┘
Important

vLLM must serve the same base model specified in your training config. If the models do not match, weight synchronization will silently produce incorrect results.

2 Server Mode

Server mode runs vLLM as an external process on dedicated GPU(s). This is the recommended configuration for most setups.

2.1 Starting the Server

Use the axolotl vllm-serve command with your training config:

# Terminal 1: Start vLLM on GPU 0
CUDA_VISIBLE_DEVICES=0 axolotl vllm-serve grpo_config.yaml
# Terminal 2: Start training on GPU 1
CUDA_VISIBLE_DEVICES=1 axolotl train grpo_config.yaml

The server reads vLLM settings from the vllm: section of your config and starts an HTTP server (default: http://0.0.0.0:8000).

Tip

Use tmux or screen to manage the vLLM server process. Typical startup time is 30-90 seconds depending on model size and whether CUDA graphs are captured.

2.2 Minimal Server Config

base_model: Qwen/Qwen2.5-1.5B-Instruct

vllm:
  host: 0.0.0.0
  port: 8000
  gpu_memory_utilization: 0.85
  dtype: auto
  max_model_len: 4096

rl: grpo
trl:
  use_vllm: true
  vllm_server_host: 0.0.0.0
  vllm_server_port: 8000
  vllm_server_timeout: 300

2.3 Multi-GPU vLLM

For larger models, use tensor parallelism across multiple GPUs:

vllm:
  tensor_parallel_size: 2
  gpu_memory_utilization: 0.85
# vLLM on GPUs 2,3; training on GPUs 0,1
CUDA_VISIBLE_DEVICES=2,3 axolotl vllm-serve grpo_config.yaml
CUDA_VISIBLE_DEVICES=0,1 axolotl train grpo_config.yaml --num-processes 2
Note

Due to how TRL maps vLLM device indices, the vLLM instance should use the last N GPUs (highest device indices), while training uses the first N.

3 Colocate Mode

Colocate mode runs vLLM on the same GPU as the trainer. This is useful when you only have a single GPU.

trl:
  use_vllm: true
  vllm_mode: colocate
  vllm_enable_sleep_mode: true

With vllm_enable_sleep_mode: true, vLLM offloads its VRAM allocation when not actively generating, freeing memory for training. When the trainer needs new completions, vLLM wakes up and reclaims VRAM.

Warning

Colocate mode is significantly slower than server mode because generation and training cannot overlap. The GPU alternates between the two workloads. This mode is practical only for smaller models (up to ~3B on a 24 GB GPU).

When to use colocate mode:

  • You have exactly one GPU
  • The model fits in memory with both vLLM and training active (with sleep mode), or is small enough to time-share
  • You accept the performance tradeoff for simpler setup (no separate vLLM process to manage)

When to use server mode:

  • You have two or more GPUs
  • You want maximum throughput (generation overlaps with training via async prefetch)
  • You are running larger models (7B+)

4 LoRA Sync

LoRA sync is the recommended weight synchronization method when training with LoRA adapters. Instead of merging adapter weights into the base model and broadcasting the full merged weights over NCCL, it saves only the LoRA adapter files to the filesystem and tells vLLM to load them natively.

4.1 How It Works

  1. The trainer calls model.save_pretrained() to write the LoRA adapter weights to a temporary directory
  2. The trainer sends an HTTP POST to /set_lora_adapter/ on the vLLM server
  3. vLLM loads the adapter using its native LoRA support (Punica kernels)
  4. Generation uses the updated adapter on the next request

4.2 Benefits

  • Smaller sync payload: Transfers ~40 MB of LoRA weights instead of ~1.4 GB+ of merged model weights (for a typical 0.5-3B model)
  • No NCCL communicator: Eliminates the need for a cross-GPU NCCL communication channel, removing GPU contention between vLLM generation and weight sync
  • Faster sync: ~200 ms per sync vs. 350 ms to 5+ seconds for NCCL merge sync
  • Simpler multi-GPU: No need to set up NCCL groups between trainer and vLLM processes

4.3 Configuration

adapter: lora
lora_r: 32
lora_alpha: 64
lora_target_linear: true

trl:
  vllm_lora_sync: true    # Enables LoRA sync mode
  vllm_sync_interval: 5   # Sync every 5 training steps

Setting vllm_lora_sync: true automatically selects the LoRA-aware vLLM serve script (axolotl.scripts.vllm_serve_lora). You do not need to set vllm.serve_module manually.

Important

LoRA sync requires that you are training with a LoRA adapter (adapter: lora or adapter: qlora). It is not applicable to full fine-tuning.

5 Weight Synchronization

During GRPO training, the policy model on the trainer is continuously updated via gradient steps. The vLLM server, however, still holds the old weights. Periodically, the trainer must push updated weights to vLLM so that future generations reflect the improved policy.

5.1 Sync Interval

The vllm_sync_interval parameter controls how often weights are synced:

trl:
  vllm_sync_interval: 5   # Sync every 5 optimizer steps

Tradeoffs:

  • Lower interval (e.g., 1-3): Fresher generations, better on-policy data, but more sync overhead per step
  • Higher interval (e.g., 5-10): Less overhead, but generations become increasingly off-policy between syncs
  • Recommended: 3-5 for most setups. Axolotl includes importance sampling correction (vllm_importance_sampling_correction: true) to handle mild distribution mismatch from stale vLLM weights.

5.2 Sync Methods

Method Config Payload Mechanism Typical Time
LoRA sync vllm_lora_sync: true LoRA adapter only (~40 MB) Filesystem + HTTP ~200 ms
NCCL merge sync Default (no lora_sync) Full merged weights (~1.4 GB+) HTTP trigger + NCCL broadcast 350 ms - 5 s
Tip

If you are training with LoRA (which is recommended for GRPO), always enable vllm_lora_sync: true. The performance difference is substantial, especially as training progresses and NCCL contention increases.

5.3 Importance Sampling Correction

When vLLM weights are stale (between syncs), the generated data is slightly off-policy. Axolotl can correct for this:

trl:
  vllm_importance_sampling_correction: true
  importance_sampling_level: token          # 'token' or 'sequence'
  off_policy_mask_threshold: 0.5            # KL threshold for masking stale sequences
  • Token-level IS is recommended when using Liger kernel (sequence-level has numerical issues with chunked computation)
  • Off-policy sequence masking (OPSM) drops sequences that have diverged too far from the current policy, providing a safety net against stale data

6 Restart Requirements

Warning

vLLM must be restarted between training runs. Weight syncs from a previous run leave the server in a corrupted state. If you start a new training run against a stale vLLM server, the model may fail to learn.

6.1 When to Restart

  • Before every new training experiment
  • After a training run crashes or is interrupted
  • If you change the base model in your config

6.2 How to Restart

Killing vLLM reliably requires terminating both the main process and its background EngineCore subprocess:

# Kill all vLLM-related processes
pkill -9 -f "vllm|EngineCore"

# Verify GPU memory is freed
nvidia-smi

# Restart the server
CUDA_VISIBLE_DEVICES=0 axolotl vllm-serve grpo_config.yaml
Tip

A single kill often does not fully stop vLLM. Always use kill -9 and verify with nvidia-smi that GPU memory has been released before restarting.

6.3 Health Check

The vLLM server exposes a health endpoint. Wait for it to return 200 before starting training:

# For the LoRA serve script (trailing slash required)
curl http://localhost:8000/health/

# For the default TRL serve script
curl http://localhost:8000/health

7 Configuration Reference

7.1 vLLM Server Options (vllm: section)

These control the vLLM server process started by axolotl vllm-serve.

Option Type Default Description
host str 0.0.0.0 Host address for the vLLM server
port int 8000 Port for the vLLM server
device str auto Device to use for vLLM
tensor_parallel_size int None Number of GPUs for tensor parallelism
data_parallel_size int None Number of data parallel replicas
gpu_memory_utilization float 0.9 Fraction of GPU memory for vLLM (0.0-1.0)
dtype str auto Data type (auto, float16, bfloat16)
max_model_len int None Maximum model context length. Set explicitly if the default is too large for your GPU
enable_prefix_caching bool None Enable prefix caching for repeated prompt prefixes
enable_reasoning bool None Enable reasoning mode for models with thinking tokens
reasoning_parser str None Parser for reasoning output
enforce_eager bool None Disable CUDA graph capture (required for some architectures like Qwen3.5 hybrid attention)
serve_module str None Python module for vLLM serve script. Auto-set when vllm_lora_sync: true
worker_extension_cls str None vLLM worker extension class for weight sync

7.2 Trainer vLLM Options (trl: section)

These control how the trainer interacts with vLLM.

Option Type Default Description
use_vllm bool false Enable vLLM for generation
vllm_mode str None server (external process) or colocate (same GPU)
vllm_server_host str 0.0.0.0 Host of the vLLM server to connect to
vllm_server_port int 8000 Port of the vLLM server to connect to
vllm_server_timeout int None Timeout in seconds for vLLM requests
vllm_lora_sync bool false Sync LoRA adapters via filesystem instead of NCCL merge
vllm_sync_interval int None Sync weights every N optimizer steps
vllm_enable_sleep_mode bool None Offload vLLM VRAM when idle (colocate mode)
vllm_guided_decoding_regex str None Regex constraint for guided decoding

For async pipeline and off-policy correction options, see the GRPO Configuration Reference.

8 Complete Example

For a full working GRPO config including vLLM, LoRA sync, async generation, rewards, and dataset setup, see the GRPO Quick Start. That config includes all the vLLM settings covered in this guide.

# Terminal 1: Start vLLM
CUDA_VISIBLE_DEVICES=0 axolotl vllm-serve grpo_config.yaml

# Wait for health check to pass
curl http://localhost:8000/health/

# Terminal 2: Start training
CUDA_VISIBLE_DEVICES=1 axolotl train grpo_config.yaml

9 Troubleshooting

Problem Likely Cause Solution
Training hangs waiting for vLLM Server not started or wrong port Check curl http://localhost:8000/health/ and verify vllm_server_host/vllm_server_port match
OOM on vLLM GPU gpu_memory_utilization too high or max_model_len too large Reduce gpu_memory_utilization to 0.7 or set max_model_len explicitly
OOM on training GPU Batch too large for policy logprobs Reduce micro_batch_size or num_generations
Accuracy stays at zero Stale vLLM from previous run Restart vLLM: pkill -9 -f "vllm\|EngineCore", verify with nvidia-smi, restart
ResponseValidationError from vLLM Missing logprobs in response Ensure you are using the correct serve module (auto-selected with vllm_lora_sync: true)
Weight sync takes 5+ seconds NCCL contention with vLLM generation Switch to vllm_lora_sync: true to eliminate NCCL
async_prefetch deadlocks with FSDP Background threads run unsynchronized FSDP collectives Set async_prefetch: false when using FSDP or DeepSpeed multi-GPU