utils.samplers.multipack
utils.samplers.multipack
Multipack Batch Sampler - An efficient batch sampler for packing variable-length sequences into fixed-capacity batches to optimize memory usage and training throughput.
Classes
Name | Description |
---|---|
MultipackBatchSampler | Batch sampler class for efficient packing of variable-length sequences |
MultipackBatchSampler
utils.samplers.multipack.MultipackBatchSampler(self,
sampler,
batch_size,
batch_max_len,
lengths,=1.0,
packing_efficiency_estimate=False,
drop_last=16,
num_count_samples=False,
sequential=100000,
group_size=200,
bin_size=None,
num_processes=True,
safe_mode**kwargs,
)
Batch sampler class for efficient packing of variable-length sequences
This sampler packs sequences into fixed-capacity bins (batches) to maximize GPU memory utilization and training throughput by reducing padding.
It supports both parallel packing (using FFD algorithm) and sequential packing (preserving original sequence order).
Methods
Name | Description |
---|---|
efficiency | Calculate the packing efficiency (ratio of tokens used to total token slots) |
gather_efficiency | Gather and synchronize packing efficiency estimates across all distributed ranks |
gather_len_batches | Gather and synchronize batch counts across all distributed ranks |
generate_batches | Generate packed batches for training |
set_epoch | Set the epoch number, used for reproducible shuffling across epochs |
efficiency
utils.samplers.multipack.MultipackBatchSampler.efficiency()
Calculate the packing efficiency (ratio of tokens used to total token slots) Higher is better - 1.0 would mean perfect packing with no wasted space
gather_efficiency
utils.samplers.multipack.MultipackBatchSampler.gather_efficiency()
Gather and synchronize packing efficiency estimates across all distributed ranks Returns a conservative efficiency estimate based on the measurements
gather_len_batches
utils.samplers.multipack.MultipackBatchSampler.gather_len_batches(num)
Gather and synchronize batch counts across all distributed ranks Returns the minimum number of batches available on any rank
generate_batches
=False) utils.samplers.multipack.MultipackBatchSampler.generate_batches(set_stats
Generate packed batches for training
Parameters
Name | Type | Description | Default |
---|---|---|---|
set_stats | Whether to update efficiency statistics | False |
Returns
Name | Type | Description |
---|---|---|
List of batches, where each batch contains multiple bins, | ||
and each bin contains multiple sequence indices |
set_epoch
utils.samplers.multipack.MultipackBatchSampler.set_epoch(epoch)
Set the epoch number, used for reproducible shuffling across epochs
Functions
Name | Description |
---|---|
allocate_sequentially | Sequential allocator that preserves example order |
ffd_check | First-fit-decreasing bin packing algorithm check |
pack_group | Pack a group of sequences into bins using First-Fit Decreasing algorithm |
pack_parallel | Pack sequences into bins using parallel processing |
allocate_sequentially
utils.samplers.multipack.allocate_sequentially(
sequence_lengths,
rank,
bin_capacity,
num_ranks, )
Sequential allocator that preserves example order
Parameters
Name | Type | Description | Default |
---|---|---|---|
sequence_lengths | np.ndarray | The lengths of all examples | required |
rank | int | The current rank (for distributed training) | required |
bin_capacity | int | The capacity of each bin (maximum sequence length) | required |
num_ranks | int | Number of ranks (processes/GPUs) | required |
Returns
Name | Type | Description |
---|---|---|
rank_batches | List of batches for the current rank | |
total_tokens_used | Number of actual example tokens | |
total_token_slots | Maximum theoretical number of example tokens (number of bins * bin capacity) |
ffd_check
utils.samplers.multipack.ffd_check(sequence_lengths, bin_capacity, num_bins)
First-fit-decreasing bin packing algorithm check
Checks if sequences with the given lengths could fit in the specified number of bins
Parameters
Name | Type | Description | Default |
---|---|---|---|
sequence_lengths | np.ndarray | Array of sequence lengths | required |
bin_capacity | int | Maximum capacity of each bin | required |
num_bins | int | Number of bins available | required |
Returns
Name | Type | Description |
---|---|---|
True if all sequences can be packed, False otherwise |
pack_group
utils.samplers.multipack.pack_group(
sequence_lengths,
group_offset,
bin_capacity,
max_bins,
bin_size,=True,
safe_mode )
Pack a group of sequences into bins using First-Fit Decreasing algorithm
Parameters
Name | Type | Description | Default |
---|---|---|---|
sequence_lengths | np.ndarray | Array of sequence lengths | required |
group_offset | int | Offset to apply to indices when returning results | required |
bin_capacity | int | Maximum capacity of each bin | required |
max_bins | int | Maximum number of bins to use | required |
bin_size | int | Maximum number of sequences per bin | required |
safe_mode | bool | If True, use a more conservative packing approach | True |
Returns
Name | Type | Description |
---|---|---|
List of bins, where each bin contains indices of sequences assigned to it |
pack_parallel
utils.samplers.multipack.pack_parallel(
sequence_lengths,
bin_capacity,
group_size,
bin_size,=None,
num_processes=True,
safe_mode='spawn',
mp_start_method )
Pack sequences into bins using parallel processing
Parameters
Name | Type | Description | Default |
---|---|---|---|
sequence_lengths | np.ndarray | Array of sequence lengths | required |
bin_capacity | int | Maximum capacity of each bin as total number of tokens | required |
group_size | int | Number of sequences to process in each group | required |
bin_size | int | Maximum number of bins to use | required |
num_processes | int | None | Number of parallel processes to use | None |
safe_mode | bool | If True, use a more conservative packing approach | True |
mp_start_method | str | None | Multiprocessing start method (‘fork’, ‘spawn’, ‘forkserver’). ‘spawn’ is often safer with Numba/PyTorch. Set to None to use system default. | 'spawn' |
Returns: List of bins, where each bin contains indices of sequences assigned to it