common.datasets

common.datasets

Dataset loading utilities.

Classes

Name Description
TrainDatasetMeta Dataclass with fields for training and validation datasets and metadata.

TrainDatasetMeta

common.datasets.TrainDatasetMeta(
    self,
    train_dataset,
    eval_dataset=None,
    total_num_steps=None,
)

Dataclass with fields for training and validation datasets and metadata.

Functions

Name Description
load_datasets Loads one or more training or evaluation datasets, calling
load_preference_datasets Loads one or more training or evaluation datasets for RL training using paired
sample_dataset Randomly sample num_samples samples from dataset.

load_datasets

common.datasets.load_datasets(cfg, cli_args=None, debug=False)

Loads one or more training or evaluation datasets, calling axolotl.utils.data.prepare_dataset. Optionally, logs out debug information.

Parameters

Name Type Description Default
cfg DictDefault Dictionary mapping axolotl config keys to values. required
cli_args PreprocessCliArgs | TrainerCliArgs | None Command-specific CLI arguments. None
debug bool Whether to print out tokenization of sample False

Returns

Name Type Description
TrainDatasetMeta Dataclass with fields for training and evaluation datasets and the computed
TrainDatasetMeta total_num_steps.

load_preference_datasets

common.datasets.load_preference_datasets(cfg, cli_args)

Loads one or more training or evaluation datasets for RL training using paired preference data, calling axolotl.utils.data.rl.load_prepare_preference_datasets. Optionally, logs out debug information.

Parameters

Name Type Description Default
cfg DictDefault Dictionary mapping axolotl config keys to values. required
cli_args Union[PreprocessCliArgs, TrainerCliArgs] Command-specific CLI arguments. required

Returns

Name Type Description
TrainDatasetMeta Dataclass with fields for training and evaluation datasets and the computed
TrainDatasetMeta total_num_steps.

sample_dataset

common.datasets.sample_dataset(dataset, num_samples)

Randomly sample num_samples samples from dataset.

Parameters

Name Type Description Default
dataset Dataset Dataset. required
num_samples int Number of samples to return. required

Returns

Name Type Description
Dataset Random sample (with replacement) of examples in dataset.