utils.data.utils

utils.data.utils

Data handling helpers

Classes

Name Description
RetryStrategy Enum for retry strategies.

RetryStrategy

utils.data.utils.RetryStrategy()

Enum for retry strategies.

Functions

Name Description
deduplicate_and_log_datasets Deduplicate datasets, with optional cross-dataset deduplication.
handle_long_seq_in_dataset Remove sequences longer than configured maximum from dataset.
keep_min_len Batched filter function that keeps only samples with sequence length >= min_sequence_len.
md5 Generate MD5 hash of a string.
remove_double_bos_token Remove double bos tokens that may occur when retokenizing preprocessed data
retry_on_request_exceptions Decorator that retries function calls on specific request exceptions.
sha256 Generate SHA256 hash of a string.
truncate_long_seq Truncate samples whose sequence length is too long (> sequence_len).

deduplicate_and_log_datasets

utils.data.utils.deduplicate_and_log_datasets(
    dataset,
    other_dataset=None,
    dataset_name='train',
    other_name='eval',
)

Deduplicate datasets, with optional cross-dataset deduplication.

Parameters

Name Type Description Default
dataset Dataset Primary dataset to deduplicate. required
other_dataset Dataset | None Optional second dataset to deduplicate against the first. None
dataset_name str | None Name for the primary dataset (for logging). 'train'
other_name str | None Name for the second dataset (for logging). 'eval'

Returns

Name Type Description
tuple[Dataset, Dataset | None] Tuple of (deduplicated_dataset, deduplicated_other_dataset).

handle_long_seq_in_dataset

utils.data.utils.handle_long_seq_in_dataset(dataset, sequence_len, cfg)

Remove sequences longer than configured maximum from dataset.

Parameters

Name Type Description Default
dataset Dataset Dataset to filter. required
sequence_len int Maximum length for sequences to keep required
cfg DictDefault Dictionary mapping axolotl config keys to values. required

Returns

Name Type Description
Dataset Filtered dataset with long sequences handled according to the excess_length_strategy value: ‘drop’ (default) excludes any sequence longer than sequence_len ‘truncate’ truncates them down to sequence_len ‘raise’ raises a ValueError if any sequence was found that was longer than sequence_len

keep_min_len

utils.data.utils.keep_min_len(sample, min_sequence_len=2)

Batched filter function that keeps only samples with sequence length >= min_sequence_len. Returns a list of booleans indicating which samples to keep.

md5

utils.data.utils.md5(to_hash, encoding='utf-8')

Generate MD5 hash of a string.

remove_double_bos_token

utils.data.utils.remove_double_bos_token(example, bos_token_id)

Remove double bos tokens that may occur when retokenizing preprocessed data for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.

retry_on_request_exceptions

utils.data.utils.retry_on_request_exceptions(
    max_retries=3,
    delay=1,
    retry_strategy=RetryStrategy.LINEAR,
)

Decorator that retries function calls on specific request exceptions.

Parameters

Name Type Description Default
max_retries Maximum number of retry attempts. 3
delay Base delay between retries in seconds. 1
retry_strategy RetryStrategy Strategy for calculating retry delays. RetryStrategy.LINEAR

Returns

Name Type Description
Callable Decorated function with retry logic.

sha256

utils.data.utils.sha256(to_hash, encoding='utf-8')

Generate SHA256 hash of a string.

truncate_long_seq

utils.data.utils.truncate_long_seq(sample, sequence_len=2048)

Truncate samples whose sequence length is too long (> sequence_len). Modifies the sample in-place and returns the modified sample.