core.trainers.grpo.fast_async_trainer
core.trainers.grpo.fast_async_trainer
Experimental GRPO extensions: parallel reward workers, replay buffer, deferred re-roll, and zero-advantage skipping.
These features are built as subclasses of GRPOTrainer and GRPODataProducer, using the hook system (_compute_rewards_for_batch, _post_advantage_hook, _pre_produce_hook) defined in the base classes.
Classes
| Name | Description |
|---|---|
| FastAsyncGRPOConfig | GRPOConfig with additional experimental parameters. |
| FastAsyncGRPOTrainer | GRPOTrainer with experimental extensions. |
| RerollDataProducer | GRPODataProducer that injects re-roll candidates into prompt batches. |
FastAsyncGRPOConfig
core.trainers.grpo.fast_async_trainer.FastAsyncGRPOConfig(
use_data_producer=False,
async_prefetch=False,
prefetch_depth=1,
vllm_sync_interval=1,
batch_flattening=False,
streaming_partial_batch=False,
streaming_min_groups=1,
vllm_importance_sampling_correction=True,
vllm_importance_sampling_mode='token_truncate',
vllm_importance_sampling_cap=3.0,
off_policy_mask_threshold=None,
use_bias_correction_kl=False,
reward_num_workers=1,
replay_buffer_size=0,
replay_recompute_logps=True,
reroll_start_fraction=0.5,
reroll_max_groups=1,
skip_zero_advantage_batches=True,
vllm_lora_sync=False,
)GRPOConfig with additional experimental parameters.
FastAsyncGRPOTrainer
core.trainers.grpo.fast_async_trainer.FastAsyncGRPOTrainer(*args, **kwargs)GRPOTrainer with experimental extensions.
Adds:
- Parallel reward subprocess workers (reward_num_workers)
- Replay buffer for high-signal group reuse (replay_buffer_size)
- Deferred re-roll of failed prompts (reroll_start_fraction)
- Zero-advantage micro-batch skipping
Methods
| Name | Description |
|---|---|
| compute_liger_loss | Liger loss with zero-adv skipping and off-policy sequence masking (OPSM). |
compute_liger_loss
core.trainers.grpo.fast_async_trainer.FastAsyncGRPOTrainer.compute_liger_loss(
unwrapped_model,
inputs,
)Liger loss with zero-adv skipping and off-policy sequence masking (OPSM).
The base class Liger path doesn’t support OPSM because the fused kernel doesn’t expose per-token logprobs needed for the KL computation. This override computes them via chunked lm_head matmul (no grad, low memory) and applies the OPSM to the loss mask before calling the kernel.
RerollDataProducer
core.trainers.grpo.fast_async_trainer.RerollDataProducer(
config,
prompt_dataset,
*,
num_generations,
generation_batch_size,
train_batch_size,
steps_per_generation,
shuffle_dataset,
seed,
)GRPODataProducer that injects re-roll candidates into prompt batches.
Reads from the trainer’s _reroll_buffer (populated by
GRPOExperimentalTrainer._post_advantage_hook) and replaces the
last N prompt groups with previously-failed prompts.