Dataset Loading
Overview
Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.
Loading Datasets
We use the datasets
library to load datasets and a mix of load_dataset
and load_from_disk
to load them.
You may recognize the similar named configs between load_dataset
and the datasets
section of the config file.
datasets:
- path:
name:
data_files:
split:
revision:
trust_remote_code:
Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path
and sometimes data_files
.
This matches the API of datasets.load_dataset
, so if you’re familiar with that, you will feel right at home.
For HuggingFace’s guide to load different dataset types, see here.
For full details on the config, see config.qmd.
You can set multiple datasets in the config file by more than one entry under datasets
.
datasets:
- path: /path/to/your/dataset
- path: /path/to/your/other/dataset
Local dataset
Files
Usually, to load a JSON file, you would do something like this:
from datasets import load_dataset
= load_dataset("json", data_files="data.json") dataset
Which translates to the following config:
datasets:
- path: json
data_files: /path/to/your/file.jsonl
However, to make things easier, we have added a few shortcuts for loading local dataset files.
You can just point the path
to the file or directory along with the ds_type
to load the dataset. The below example shows for a JSON file:
datasets:
- path: /path/to/your/file.jsonl
ds_type: json
This works for CSV, JSON, Parquet, and Arrow files.
If path
points to a file and ds_type
is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type
if you’d like.
Directory
If you’re loading a directory, you can point the path
to the directory.
Then, you have two options:
Loading entire directory
You do not need any additional configs.
We will attempt to load in the following order:
- datasets saved with datasets.save_to_disk
- loading entire directory of files (such as with parquet/arrow files)
datasets:
- path: /path/to/your/directory
Loading specific files in directory
Provide data_files
with a list of files to load.
datasets:
# single file
- path: /path/to/your/directory
ds_type: csv
data_files: file1.csv
# multiple files
- path: /path/to/your/directory
ds_type: json
data_files:
- file1.jsonl
- file2.jsonl
# multiple files for parquet
- path: /path/to/your/directory
ds_type: parquet
data_files:
- file1.parquet
- file2.parquet
HuggingFace Hub
The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.
If you’re using a private dataset, you will need to enable the hf_use_auth_token
flag in the root-level of the config file.
Folder uploaded
This would mean that the dataset is a single file or file(s) uploaded to the Hub.
datasets:
- path: org/dataset-name
data_files:
- file1.jsonl
- file2.jsonl
HuggingFace Dataset
This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub
.
datasets:
- path: org/dataset-name
There are some other configs which may be required like name
, split
, revision
, trust_remote_code
, etc depending on the dataset.
Remote Filesystems
Via the storage_options
config under load_dataset
, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.
This is currently experimental. Please let us know if you run into any issues!
The only difference between the providers is that you need to prepend the path with the respective protocols.
datasets:
# Single file
- path: s3://bucket-name/path/to/your/file.jsonl
# Directory
- path: s3://bucket-name/path/to/your/directory
For directory, we load via load_from_disk
.
S3
Prepend the path with s3://
.
The credentials are pulled in the following order:
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
, andAWS_SESSION_TOKEN
environment variables- from the
~/.aws/credentials
file - for nodes on EC2, the IAM metadata provider
We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.
Other environment variables that can be set can be found in boto3 docs
GCS
Prepend the path with gs://
or gcs://
.
The credentials are loaded in the following order:
- gcloud credentials
- for nodes on GCP, the google metadata service
- anonymous access
Azure
Gen 1
Prepend the path with adl://
.
Ensure you have the following environment variables set:
AZURE_STORAGE_TENANT_ID
AZURE_STORAGE_CLIENT_ID
AZURE_STORAGE_CLIENT_SECRET
Gen 2
Prepend the path with abfs://
or az://
.
Ensure you have the following environment variables set:
AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_ACCOUNT_KEY
Other environment variables that can be set can be found in adlfs docs
OCI
Prepend the path with oci://
.
It would attempt to read in the following order:
OCIFS_IAM_TYPE
,OCIFS_CONFIG_LOCATION
, andOCIFS_CONFIG_PROFILE
environment variables- when on OCI resource, resource principal
Other environment variables:
OCI_REGION_METADATA
Please see the ocifs docs.
HTTPS
The path should start with https://
.
datasets:
- path: https://path/to/your/dataset/file.jsonl
This must be publically accessible.
Next steps
Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.