utils.samplers.multipack

utils.samplers.multipack

Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences into fixed-capacity batches to optimize memory usage and training throughput.

Classes

Name	Description
MultipackBatchSampler	Batch sampler class for efficient packing of variable-length sequences

MultipackBatchSampler

utils.samplers.multipack.MultipackBatchSampler(
    sampler,
    batch_size,
    batch_max_len,
    lengths,
    packing_efficiency_estimate=1.0,
    drop_last=True,
    num_count_samples=4,
    sequential=False,
    group_size=100000,
    bin_size=200,
    num_processes=None,
    safe_mode=True,
    mp_start_method='fork',
    **kwargs,
)

Batch sampler class for efficient packing of variable-length sequences

This sampler packs sequences into fixed-capacity bins (batches) to maximize GPU memory utilization and training throughput by reducing padding.

It supports both parallel packing (using FFD algorithm) and sequential packing (preserving original sequence order).

Methods

Name	Description
efficiency	Calculate the packing efficiency (ratio of tokens used to total token slots).
gather_efficiency	Gather and synchronize packing efficiency estimates across all distributed
gather_len_batches	Gather and synchronize batch counts across all distributed ranks. Returns
generate_batches	Generate packed batches for training.
set_epoch	Set the epoch number, used for reproducible shuffling across epochs

efficiency

utils.samplers.multipack.MultipackBatchSampler.efficiency()

Calculate the packing efficiency (ratio of tokens used to total token slots). Higher is better - 1.0 would mean perfect packing with no wasted space.

gather_efficiency

utils.samplers.multipack.MultipackBatchSampler.gather_efficiency()

Gather and synchronize packing efficiency estimates across all distributed ranks.

Returns

Name	Type	Description
	float	A conservative efficiency estimate based on the measurements.

gather_len_batches

utils.samplers.multipack.MultipackBatchSampler.gather_len_batches(num)

Gather and synchronize batch counts across all distributed ranks. Returns the minimum number of batches available on any rank.

generate_batches

utils.samplers.multipack.MultipackBatchSampler.generate_batches(set_stats=False)

Generate packed batches for training.

Parameters

Name	Type	Description	Default
set_stats	bool	Whether to update efficiency statistics.	`False`

Returns

Name	Type	Description
	list[list[list[int]]]	List of batches, where each batch contains multiple bins, and each bin contains multiple sequence indices.

set_epoch

utils.samplers.multipack.MultipackBatchSampler.set_epoch(epoch)

Set the epoch number, used for reproducible shuffling across epochs

Functions

Name	Description
allocate_sequentially	Sequential allocator that preserves example order.
ffd_check	First-fit-decreasing bin packing algorithm check.
pack_group	Pack a group of sequences into bins using First-Fit Decreasing algorithm.
pack_parallel	Pack sequences into bins using parallel processing.

allocate_sequentially

utils.samplers.multipack.allocate_sequentially(
    sequence_lengths,
    rank,
    bin_capacity,
    num_ranks,
)

Sequential allocator that preserves example order.

Parameters

Name	Type	Description	Default
sequence_lengths	np.ndarray	The lengths of all examples.	required
rank	int	The current rank (for distributed training).	required
bin_capacity	int	The capacity of each bin (maximum sequence length).	required
num_ranks	int	Number of ranks (processes / GPUs).	required

Returns

Name	Type	Description
rank_batches	list[list[int]]	List of batches for the current rank.
total_tokens_used	int	Number of actual example tokens.
total_token_slots	int	Maximum theoretical number of example tokens (number of bins * bin capacity).

ffd_check

utils.samplers.multipack.ffd_check(sequence_lengths, bin_capacity, num_bins)

First-fit-decreasing bin packing algorithm check.

Checks if sequences with the given lengths could fit in the specified number of bins.

Parameters

Name	Type	Description	Default
sequence_lengths	np.ndarray	Array of sequence lengths.	required
bin_capacity	int	Maximum capacity of each bin.	required
num_bins	int	Number of bins available.	required

Returns

Name	Type	Description
	bool	`True` if all sequences can be packed, `False` otherwise.

pack_group

utils.samplers.multipack.pack_group(
    sequence_lengths,
    group_offset,
    bin_capacity,
    max_bins,
    bin_size,
    safe_mode=True,
)

Pack a group of sequences into bins using First-Fit Decreasing algorithm.

Parameters

Name	Type	Description	Default
sequence_lengths	np.ndarray	Array of sequence lengths.	required
group_offset	int	Offset to apply to indices when returning results.	required
bin_capacity	int	Maximum capacity of each bin.	required
max_bins	int	Maximum number of bins to use.	required
bin_size	int	Maximum number of sequences per bin.	required
safe_mode	bool	If True, use a more conservative packing approach.	`True`

Returns

Name	Type	Description
	list[list[int]]	List of bins, where each bin contains indices of sequences assigned to it.

pack_parallel

utils.samplers.multipack.pack_parallel(
    sequence_lengths,
    bin_capacity,
    group_size,
    bin_size,
    num_processes=None,
    safe_mode=True,
    mp_start_method='fork',
)

Pack sequences into bins using parallel processing.

Parameters

Name	Type	Description	Default
sequence_lengths	np.ndarray	Array of sequence lengths.	required
bin_capacity	int	Maximum capacity of each bin as total number of tokens.	required
group_size	int	Number of sequences to process in each group.	required
bin_size	int	Maximum number of bins to use.	required
num_processes	int \| None	Number of parallel processes to use.	`None`
safe_mode	bool	If True, use a more conservative packing approach.	`True`
mp_start_method	str \| None	Multiprocessing start method (‘fork’, ‘spawn’, ‘forkserver’). ‘spawn’ is often safer with Numba/PyTorch. Set to None to use system default.	`'fork'`

Returns: List of bins, where each bin contains indices of sequences assigned to it.