utils.samplers.multipack
utils.samplers.multipack
Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences into fixed-capacity batches to optimize memory usage and training throughput.
Classes
| Name | Description | 
|---|---|
| MultipackBatchSampler | Batch sampler class for efficient packing of variable-length sequences | 
MultipackBatchSampler
utils.samplers.multipack.MultipackBatchSampler(
    sampler,
    batch_size,
    batch_max_len,
    lengths,
    packing_efficiency_estimate=1.0,
    drop_last=True,
    num_count_samples=4,
    sequential=False,
    group_size=100000,
    bin_size=200,
    num_processes=None,
    safe_mode=True,
    mp_start_method='fork',
    **kwargs,
)Batch sampler class for efficient packing of variable-length sequences
This sampler packs sequences into fixed-capacity bins (batches) to maximize GPU memory utilization and training throughput by reducing padding.
It supports both parallel packing (using FFD algorithm) and sequential packing (preserving original sequence order).
Methods
| Name | Description | 
|---|---|
| efficiency | Calculate the packing efficiency (ratio of tokens used to total token slots). | 
| gather_efficiency | Gather and synchronize packing efficiency estimates across all distributed | 
| gather_len_batches | Gather and synchronize batch counts across all distributed ranks. Returns | 
| generate_batches | Generate packed batches for training. | 
| set_epoch | Set the epoch number, used for reproducible shuffling across epochs | 
efficiency
utils.samplers.multipack.MultipackBatchSampler.efficiency()Calculate the packing efficiency (ratio of tokens used to total token slots). Higher is better - 1.0 would mean perfect packing with no wasted space.
gather_efficiency
utils.samplers.multipack.MultipackBatchSampler.gather_efficiency()Gather and synchronize packing efficiency estimates across all distributed ranks.
Returns
| Name | Type | Description | 
|---|---|---|
| float | A conservative efficiency estimate based on the measurements. | 
gather_len_batches
utils.samplers.multipack.MultipackBatchSampler.gather_len_batches(num)Gather and synchronize batch counts across all distributed ranks. Returns the minimum number of batches available on any rank.
generate_batches
utils.samplers.multipack.MultipackBatchSampler.generate_batches(set_stats=False)Generate packed batches for training.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| set_stats | bool | Whether to update efficiency statistics. | False | 
Returns
| Name | Type | Description | 
|---|---|---|
| list[list[list[int]]] | List of batches, where each batch contains multiple bins, and each bin contains multiple sequence indices. | 
set_epoch
utils.samplers.multipack.MultipackBatchSampler.set_epoch(epoch)Set the epoch number, used for reproducible shuffling across epochs
Functions
| Name | Description | 
|---|---|
| allocate_sequentially | Sequential allocator that preserves example order. | 
| ffd_check | First-fit-decreasing bin packing algorithm check. | 
| pack_group | Pack a group of sequences into bins using First-Fit Decreasing algorithm. | 
| pack_parallel | Pack sequences into bins using parallel processing. | 
allocate_sequentially
utils.samplers.multipack.allocate_sequentially(
    sequence_lengths,
    rank,
    bin_capacity,
    num_ranks,
)Sequential allocator that preserves example order.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| sequence_lengths | np.ndarray | The lengths of all examples. | required | 
| rank | int | The current rank (for distributed training). | required | 
| bin_capacity | int | The capacity of each bin (maximum sequence length). | required | 
| num_ranks | int | Number of ranks (processes / GPUs). | required | 
Returns
| Name | Type | Description | 
|---|---|---|
| rank_batches | list[list[int]] | List of batches for the current rank. | 
| total_tokens_used | int | Number of actual example tokens. | 
| total_token_slots | int | Maximum theoretical number of example tokens (number of bins * bin capacity). | 
ffd_check
utils.samplers.multipack.ffd_check(sequence_lengths, bin_capacity, num_bins)First-fit-decreasing bin packing algorithm check.
Checks if sequences with the given lengths could fit in the specified number of bins.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required | 
| bin_capacity | int | Maximum capacity of each bin. | required | 
| num_bins | int | Number of bins available. | required | 
Returns
| Name | Type | Description | 
|---|---|---|
| bool | True if all sequences can be packed, False otherwise. | 
pack_group
utils.samplers.multipack.pack_group(
    sequence_lengths,
    group_offset,
    bin_capacity,
    max_bins,
    bin_size,
    safe_mode=True,
)Pack a group of sequences into bins using First-Fit Decreasing algorithm.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required | 
| group_offset | int | Offset to apply to indices when returning results. | required | 
| bin_capacity | int | Maximum capacity of each bin. | required | 
| max_bins | int | Maximum number of bins to use. | required | 
| bin_size | int | Maximum number of sequences per bin. | required | 
| safe_mode | bool | If True, use a more conservative packing approach. | True | 
Returns
| Name | Type | Description | 
|---|---|---|
| list[list[int]] | List of bins, where each bin contains indices of sequences assigned to it. | 
pack_parallel
utils.samplers.multipack.pack_parallel(
    sequence_lengths,
    bin_capacity,
    group_size,
    bin_size,
    num_processes=None,
    safe_mode=True,
    mp_start_method='fork',
)Pack sequences into bins using parallel processing.
Parameters
| Name | Type | Description | Default | 
|---|---|---|---|
| sequence_lengths | np.ndarray | Array of sequence lengths. | required | 
| bin_capacity | int | Maximum capacity of each bin as total number of tokens. | required | 
| group_size | int | Number of sequences to process in each group. | required | 
| bin_size | int | Maximum number of bins to use. | required | 
| num_processes | int | None | Number of parallel processes to use. | None | 
| safe_mode | bool | If True, use a more conservative packing approach. | True | 
| mp_start_method | str | None | Multiprocessing start method (‘fork’, ‘spawn’, ‘forkserver’). ‘spawn’ is often safer with Numba/PyTorch. Set to None to use system default. | 'fork' | 
Returns: List of bins, where each bin contains indices of sequences assigned to it.