core.trainers.grpo.fast_async_trainer

core.trainers.grpo.fast_async_trainer

Experimental GRPO extensions: parallel reward workers, replay buffer, deferred re-roll, and zero-advantage skipping.

These features are built as subclasses of GRPOTrainer and GRPODataProducer, using the hook system (_compute_rewards_for_batch, _post_advantage_hook, _pre_produce_hook) defined in the base classes.

Classes

Name Description
FastAsyncGRPOConfig GRPOConfig with additional experimental parameters.
FastAsyncGRPOTrainer GRPOTrainer with experimental extensions.
RerollDataProducer GRPODataProducer that injects re-roll candidates into prompt batches.

FastAsyncGRPOConfig

core.trainers.grpo.fast_async_trainer.FastAsyncGRPOConfig(
    use_data_producer=False,
    async_prefetch=False,
    prefetch_depth=1,
    vllm_sync_interval=1,
    batch_flattening=False,
    streaming_partial_batch=False,
    streaming_min_groups=1,
    vllm_importance_sampling_correction=True,
    vllm_importance_sampling_mode='token_truncate',
    vllm_importance_sampling_cap=3.0,
    off_policy_mask_threshold=None,
    use_bias_correction_kl=False,
    reward_num_workers=1,
    replay_buffer_size=0,
    replay_recompute_logps=True,
    reroll_start_fraction=0.5,
    reroll_max_groups=1,
    skip_zero_advantage_batches=True,
    vllm_lora_sync=False,
)

GRPOConfig with additional experimental parameters.

FastAsyncGRPOTrainer

core.trainers.grpo.fast_async_trainer.FastAsyncGRPOTrainer(*args, **kwargs)

GRPOTrainer with experimental extensions.

Adds: - Parallel reward subprocess workers (reward_num_workers) - Replay buffer for high-signal group reuse (replay_buffer_size) - Deferred re-roll of failed prompts (reroll_start_fraction) - Zero-advantage micro-batch skipping

Methods

Name Description
compute_liger_loss Liger loss with zero-adv skipping and off-policy sequence masking (OPSM).
compute_liger_loss
core.trainers.grpo.fast_async_trainer.FastAsyncGRPOTrainer.compute_liger_loss(
    unwrapped_model,
    inputs,
)

Liger loss with zero-adv skipping and off-policy sequence masking (OPSM).

The base class Liger path doesn’t support OPSM because the fused kernel doesn’t expose per-token logprobs needed for the KL computation. This override computes them via chunked lm_head matmul (no grad, low memory) and applies the OPSM to the loss mask before calling the kernel.

RerollDataProducer

core.trainers.grpo.fast_async_trainer.RerollDataProducer(
    config,
    prompt_dataset,
    *,
    num_generations,
    generation_batch_size,
    train_batch_size,
    steps_per_generation,
    shuffle_dataset,
    seed,
)

GRPODataProducer that injects re-roll candidates into prompt batches.

Reads from the trainer’s _reroll_buffer (populated by GRPOExperimentalTrainer._post_advantage_hook) and replaces the last N prompt groups with previously-failed prompts.