utils.samplers.multipack

utils.samplers.multipack

Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences into fixed-capacity batches to optimize memory usage and training throughput.

Classes

Name Description
MultipackBatchSampler Batch sampler class for efficient packing of variable-length sequences

MultipackBatchSampler

utils.samplers.multipack.MultipackBatchSampler(
    self,
    sampler,
    batch_size,
    batch_max_len,
    lengths,
    packing_efficiency_estimate=1.0,
    drop_last=False,
    num_count_samples=16,
    sequential=False,
    group_size=100000,
    bin_size=200,
    num_processes=None,
    safe_mode=True,
    **kwargs,
)

Batch sampler class for efficient packing of variable-length sequences

This sampler packs sequences into fixed-capacity bins (batches) to maximize GPU memory utilization and training throughput by reducing padding.

It supports both parallel packing (using FFD algorithm) and sequential packing (preserving original sequence order).

Methods

Name Description
efficiency Calculate the packing efficiency (ratio of tokens used to total token slots)
gather_efficiency Gather and synchronize packing efficiency estimates across all distributed ranks
gather_len_batches Gather and synchronize batch counts across all distributed ranks
generate_batches Generate packed batches for training
set_epoch Set the epoch number, used for reproducible shuffling across epochs
efficiency
utils.samplers.multipack.MultipackBatchSampler.efficiency()

Calculate the packing efficiency (ratio of tokens used to total token slots) Higher is better - 1.0 would mean perfect packing with no wasted space

gather_efficiency
utils.samplers.multipack.MultipackBatchSampler.gather_efficiency()

Gather and synchronize packing efficiency estimates across all distributed ranks Returns a conservative efficiency estimate based on the measurements

gather_len_batches
utils.samplers.multipack.MultipackBatchSampler.gather_len_batches(num)

Gather and synchronize batch counts across all distributed ranks Returns the minimum number of batches available on any rank

generate_batches
utils.samplers.multipack.MultipackBatchSampler.generate_batches(set_stats=False)

Generate packed batches for training

Parameters
Name Type Description Default
set_stats Whether to update efficiency statistics False
Returns
Name Type Description
List of batches, where each batch contains multiple bins,
and each bin contains multiple sequence indices
set_epoch
utils.samplers.multipack.MultipackBatchSampler.set_epoch(epoch)

Set the epoch number, used for reproducible shuffling across epochs

Functions

Name Description
allocate_sequentially Sequential allocator that preserves example order
ffd_check First-fit-decreasing bin packing algorithm check
pack_group Pack a group of sequences into bins using First-Fit Decreasing algorithm
pack_parallel Pack sequences into bins using parallel processing

allocate_sequentially

utils.samplers.multipack.allocate_sequentially(
    sequence_lengths,
    rank,
    bin_capacity,
    num_ranks,
)

Sequential allocator that preserves example order

Parameters

Name Type Description Default
sequence_lengths np.ndarray The lengths of all examples required
rank int The current rank (for distributed training) required
bin_capacity int The capacity of each bin (maximum sequence length) required
num_ranks int Number of ranks (processes/GPUs) required

Returns

Name Type Description
rank_batches List of batches for the current rank
total_tokens_used Number of actual example tokens
total_token_slots Maximum theoretical number of example tokens (number of bins * bin capacity)

ffd_check

utils.samplers.multipack.ffd_check(sequence_lengths, bin_capacity, num_bins)

First-fit-decreasing bin packing algorithm check

Checks if sequences with the given lengths could fit in the specified number of bins

Parameters

Name Type Description Default
sequence_lengths np.ndarray Array of sequence lengths required
bin_capacity int Maximum capacity of each bin required
num_bins int Number of bins available required

Returns

Name Type Description
True if all sequences can be packed, False otherwise

pack_group

utils.samplers.multipack.pack_group(
    sequence_lengths,
    group_offset,
    bin_capacity,
    max_bins,
    bin_size,
    safe_mode=True,
)

Pack a group of sequences into bins using First-Fit Decreasing algorithm

Parameters

Name Type Description Default
sequence_lengths np.ndarray Array of sequence lengths required
group_offset int Offset to apply to indices when returning results required
bin_capacity int Maximum capacity of each bin required
max_bins int Maximum number of bins to use required
bin_size int Maximum number of sequences per bin required
safe_mode bool If True, use a more conservative packing approach True

Returns

Name Type Description
List of bins, where each bin contains indices of sequences assigned to it

pack_parallel

utils.samplers.multipack.pack_parallel(
    sequence_lengths,
    bin_capacity,
    group_size,
    bin_size,
    num_processes=None,
    safe_mode=True,
    mp_start_method='spawn',
)

Pack sequences into bins using parallel processing

Parameters

Name Type Description Default
sequence_lengths np.ndarray Array of sequence lengths required
bin_capacity int Maximum capacity of each bin as total number of tokens required
group_size int Number of sequences to process in each group required
bin_size int Maximum number of bins to use required
num_processes int | None Number of parallel processes to use None
safe_mode bool If True, use a more conservative packing approach True
mp_start_method str | None Multiprocessing start method (‘fork’, ‘spawn’, ‘forkserver’). ‘spawn’ is often safer with Numba/PyTorch. Set to None to use system default. 'spawn'

Returns: List of bins, where each bin contains indices of sequences assigned to it