datasets

datasets

Module containing Dataset functionality

Classes

Name Description
ConstantLengthDataset Iterable dataset that returns constant length chunks of tokens from stream of
TokenizedPromptDataset Dataset that returns tokenized prompts from a stream of text files.

ConstantLengthDataset

datasets.ConstantLengthDataset(tokenizer, datasets, seq_length=2048)

Iterable dataset that returns constant length chunks of tokens from stream of text files.

Parameters

Name Type Description Default
tokenizer The processor used for processing the data. required
dataset Dataset with text files. required
seq_length Length of token sequences to return. 2048

TokenizedPromptDataset

datasets.TokenizedPromptDataset(
    prompt_tokenizer,
    dataset,
    process_count=None,
    keep_in_memory=False,
    **kwargs,
)

Dataset that returns tokenized prompts from a stream of text files.

Parameters

Name Type Description Default
prompt_tokenizer PromptTokenizingStrategy The prompt tokenizing method for processing the data. required
dataset Dataset Dataset with text files. required
process_count int | None Number of processes to use for tokenizing. None
keep_in_memory bool | None Whether to keep the tokenized dataset in memory. False