common.datasets

common.datasets

Dataset loading utilities.

Classes

Name Description
TrainDatasetMeta Dataclass with fields for training and validation datasets and metadata.

TrainDatasetMeta

common.datasets.TrainDatasetMeta(
    train_dataset,
    eval_dataset=None,
    total_num_steps=None,
)

Dataclass with fields for training and validation datasets and metadata.

Functions

Name Description
load_datasets Loads one or more training or evaluation datasets, calling
load_preference_datasets Loads one or more training or evaluation datasets for RL training using paired
sample_dataset Randomly sample num_samples samples with replacement from dataset.

load_datasets

common.datasets.load_datasets(cfg, cli_args=None, debug=False)

Loads one or more training or evaluation datasets, calling axolotl.utils.data.prepare_datasets. Optionally, logs out debug information.

Parameters

Name Type Description Default
cfg DictDefault Dictionary mapping axolotl config keys to values. required
cli_args PreprocessCliArgs | TrainerCliArgs | None Command-specific CLI arguments. None
debug bool Whether to print out tokenization of sample. This is duplicated in cfg and cli_args, but is kept due to use in our Colab notebooks. False

Returns

Name Type Description
TrainDatasetMeta Dataclass with fields for training and evaluation datasets and the computed total_num_steps.

load_preference_datasets

common.datasets.load_preference_datasets(cfg, cli_args)

Loads one or more training or evaluation datasets for RL training using paired preference data, calling axolotl.utils.data.rl.prepare_preference_datasets. Optionally, logs out debug information.

Parameters

Name Type Description Default
cfg DictDefault Dictionary mapping axolotl config keys to values. required
cli_args PreprocessCliArgs | TrainerCliArgs Command-specific CLI arguments. required

Returns

Name Type Description
TrainDatasetMeta Dataclass with fields for training and evaluation datasets and the computed
TrainDatasetMeta total_num_steps.

sample_dataset

common.datasets.sample_dataset(dataset, num_samples)

Randomly sample num_samples samples with replacement from dataset.