datasets
datasets
Module containing dataset functionality.
We want this to be a wrapper for an existing dataset that we have loaded. Lets use the concept of middlewares to wrap each dataset. We’ll use the collators later on to pad the datasets.
Classes
Name | Description |
---|---|
TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
TokenizedPromptDataset
datasets.TokenizedPromptDataset(
prompt_tokenizer,
dataset,=None,
process_count=False,
keep_in_memory**kwargs,
)
Dataset that returns tokenized prompts from a stream of text files.
Parameters
Name | Type | Description | Default |
---|---|---|---|
prompt_tokenizer | PromptTokenizingStrategy | The prompt tokenizing method for processing the data. | required |
dataset | Dataset | Dataset with text files. | required |
process_count | int | None | Number of processes to use for tokenizing. | None |
keep_in_memory | bool | None | Whether to keep the tokenized dataset in memory. | False |