utils.data.utils
utils.data.utils
Data handling helpers
Classes
| Name | Description |
|---|---|
| RetryStrategy | Enum for retry strategies. |
RetryStrategy
utils.data.utils.RetryStrategy()Enum for retry strategies.
Functions
| Name | Description |
|---|---|
| deduplicate_and_log_datasets | Deduplicate datasets, with optional cross-dataset deduplication. |
| handle_long_seq_in_dataset | Remove sequences longer than configured maximum from dataset. |
| keep_min_len | Batched filter function that keeps only samples with sequence length >= min_sequence_len. |
| md5 | Generate MD5 hash of a string. |
| remove_double_bos_token | Remove double bos tokens that may occur when retokenizing preprocessed data |
| retry_on_request_exceptions | Decorator that retries function calls on specific request exceptions. |
| sha256 | Generate SHA256 hash of a string. |
| truncate_long_seq | Truncate samples whose sequence length is too long (> sequence_len). |
deduplicate_and_log_datasets
utils.data.utils.deduplicate_and_log_datasets(
dataset,
other_dataset=None,
dataset_name='train',
other_name='eval',
)Deduplicate datasets, with optional cross-dataset deduplication.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| dataset | Dataset | Primary dataset to deduplicate. | required |
| other_dataset | Dataset | None | Optional second dataset to deduplicate against the first. | None |
| dataset_name | str | None | Name for the primary dataset (for logging). | 'train' |
| other_name | str | None | Name for the second dataset (for logging). | 'eval' |
Returns
| Name | Type | Description |
|---|---|---|
| tuple[Dataset, Dataset | None] | Tuple of (deduplicated_dataset, deduplicated_other_dataset). |
handle_long_seq_in_dataset
utils.data.utils.handle_long_seq_in_dataset(dataset, sequence_len, cfg)Remove sequences longer than configured maximum from dataset.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| dataset | Dataset | Dataset to filter. | required |
| sequence_len | int | Maximum length for sequences to keep | required |
| cfg | DictDefault | Dictionary mapping axolotl config keys to values. |
required |
Returns
| Name | Type | Description |
|---|---|---|
| Dataset | Filtered dataset with long sequences handled according to the excess_length_strategy value: ‘drop’ (default) excludes any sequence longer than sequence_len ‘truncate’ truncates them down to sequence_len ‘raise’ raises a ValueError if any sequence was found that was longer than sequence_len |
keep_min_len
utils.data.utils.keep_min_len(sample, min_sequence_len=2)Batched filter function that keeps only samples with sequence length >= min_sequence_len. Returns a list of booleans indicating which samples to keep.
md5
utils.data.utils.md5(to_hash, encoding='utf-8')Generate MD5 hash of a string.
remove_double_bos_token
utils.data.utils.remove_double_bos_token(example, bos_token_id)Remove double bos tokens that may occur when retokenizing preprocessed data for tokenizers and chat templates that have a bos_token - eg. DPO + Llama.
retry_on_request_exceptions
utils.data.utils.retry_on_request_exceptions(
max_retries=3,
delay=1,
retry_strategy=RetryStrategy.LINEAR,
)Decorator that retries function calls on specific request exceptions.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| max_retries | Maximum number of retry attempts. | 3 |
|
| delay | Base delay between retries in seconds. | 1 |
|
| retry_strategy | RetryStrategy | Strategy for calculating retry delays. | RetryStrategy.LINEAR |
Returns
| Name | Type | Description |
|---|---|---|
| Callable | Decorated function with retry logic. |
sha256
utils.data.utils.sha256(to_hash, encoding='utf-8')Generate SHA256 hash of a string.
truncate_long_seq
utils.data.utils.truncate_long_seq(sample, sequence_len=2048)Truncate samples whose sequence length is too long (> sequence_len). Modifies the sample in-place and returns the modified sample.