datasets

datasets

Module containing dataset functionality.

We want this to be a wrapper for an existing dataset that we have loaded. Lets use the concept of middlewares to wrap each dataset. We’ll use the collators later on to pad the datasets.

Classes

Name Description
TokenizedPromptDataset Dataset that returns tokenized prompts from a stream of text files.

TokenizedPromptDataset

datasets.TokenizedPromptDataset(
    prompt_tokenizer,
    dataset,
    process_count=None,
    keep_in_memory=False,
    **kwargs,
)

Dataset that returns tokenized prompts from a stream of text files.

Parameters

Name Type Description Default
prompt_tokenizer PromptTokenizingStrategy The prompt tokenizing method for processing the data. required
dataset Dataset Dataset with text files. required
process_count int | None Number of processes to use for tokenizing. None
keep_in_memory bool | None Whether to keep the tokenized dataset in memory. False