datasets
datasets
Module containing dataset functionality.
We want this to be a wrapper for an existing dataset that we have loaded. Lets use the concept of middlewares to wrap each dataset. We’ll use the collators later on to pad the datasets.
Classes
| Name | Description |
|---|---|
| TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
TokenizedPromptDataset
datasets.TokenizedPromptDataset(
prompt_tokenizer,
dataset,
process_count=None,
keep_in_memory=False,
**kwargs,
)Dataset that returns tokenized prompts from a stream of text files.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| prompt_tokenizer | PromptTokenizingStrategy | The prompt tokenizing method for processing the data. | required |
| dataset | Dataset | Dataset with text files. | required |
| process_count | int | None | Number of processes to use for tokenizing. | None |
| keep_in_memory | bool | None | Whether to keep the tokenized dataset in memory. | False |