datasets
datasets
Module containing Dataset functionality
Classes
Name | Description |
---|---|
ConstantLengthDataset | Iterable dataset that returns constant length chunks of tokens from stream of |
TokenizedPromptDataset | Dataset that returns tokenized prompts from a stream of text files. |
ConstantLengthDataset
=2048) datasets.ConstantLengthDataset(tokenizer, datasets, seq_length
Iterable dataset that returns constant length chunks of tokens from stream of text files.
Parameters
Name | Type | Description | Default |
---|---|---|---|
tokenizer | The processor used for processing the data. | required | |
dataset | Dataset with text files. | required | |
seq_length | Length of token sequences to return. | 2048 |
TokenizedPromptDataset
datasets.TokenizedPromptDataset(
prompt_tokenizer,
dataset,=None,
process_count=False,
keep_in_memory**kwargs,
)
Dataset that returns tokenized prompts from a stream of text files.
Parameters
Name | Type | Description | Default |
---|---|---|---|
prompt_tokenizer | PromptTokenizingStrategy | The prompt tokenizing method for processing the data. | required |
dataset | Dataset | Dataset with text files. | required |
process_count | int | None | Number of processes to use for tokenizing. | None |
keep_in_memory | bool | None | Whether to keep the tokenized dataset in memory. | False |