Dataset Loading

Understanding how to load datasets from different sources

Overview

Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.

Loading Datasets

We use the datasets library to load datasets and a mix of load_dataset and load_from_disk to load them.

You may recognize the similar named configs between load_dataset and the datasets section of the config file.

datasets:
  - path:
    name:
    data_files:
    split:
    revision:
    trust_remote_code:

Tip

Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path and sometimes data_files.

This matches the API of datasets.load_dataset, so if you’re familiar with that, you will feel right at home.

For HuggingFace’s guide to load different dataset types, see here.

For full details on the config, see config-reference.qmd.

Note

You can set multiple datasets in the config file by more than one entry under datasets.

datasets:
  - path: /path/to/your/dataset
  - path: /path/to/your/other/dataset

Local dataset

Files

To load a JSON file, you would do something like this:

from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")

Which translates to the following config:

datasets:
  - path: data.json
    ds_type: json

In the example above, it can be seen that we can just point the path to the file or directory along with the ds_type to load the dataset.

This works for CSV, JSON, Parquet, and Arrow files.

Tip

If path points to a file and ds_type is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type if you’d like.

HuggingFace Hub

The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.

Note

If you’re using a private dataset, you will need to enable the hf_use_auth_token flag in the root-level of the config file.

Folder uploaded

This would mean that the dataset is a single file or file(s) uploaded to the Hub.

datasets:
  - path: org/dataset-name
    data_files:
      - file1.jsonl
      - file2.jsonl

HuggingFace Dataset

This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub.

datasets:
  - path: org/dataset-name

Note

There are some other configs which may be required like name, split, revision, trust_remote_code, etc depending on the dataset.

Remote Filesystems

Via the storage_options config under load_dataset, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.

Warning

This is currently experimental. Please let us know if you run into any issues!

The only difference between the providers is that you need to prepend the path with the respective protocols.

datasets:
    # Single file
  - path: s3://bucket-name/path/to/your/file.jsonl

    # Directory
  - path: s3://bucket-name/path/to/your/directory

For directory, we load via load_from_disk.

S3

Prepend the path with s3://.

The credentials are pulled in the following order:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables
from the ~/.aws/credentials file
for nodes on EC2, the IAM metadata provider

Note

We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.

Other environment variables that can be set can be found in boto3 docs

GCS

Prepend the path with gs:// or gcs://.

The credentials are loaded in the following order:

gcloud credentials
for nodes on GCP, the google metadata service
anonymous access

Azure

Gen 1

Prepend the path with adl://.

Ensure you have the following environment variables set:

AZURE_STORAGE_TENANT_ID
AZURE_STORAGE_CLIENT_ID
AZURE_STORAGE_CLIENT_SECRET

Gen 2

Prepend the path with abfs:// or az://.

Ensure you have the following environment variables set:

AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_ACCOUNT_KEY

Other environment variables that can be set can be found in adlfs docs

OCI

Prepend the path with oci://.

It would attempt to read in the following order:

OCIFS_IAM_TYPE, OCIFS_CONFIG_LOCATION, and OCIFS_CONFIG_PROFILE environment variables
when on OCI resource, resource principal

Other environment variables:

OCI_REGION_METADATA

Please see the ocifs docs.

HTTPS

The path should start with https://.

datasets:
  - path: https://path/to/your/dataset/file.jsonl

This must be publically accessible.

Next steps

Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.

Dataset Loading

Overview

Loading Datasets

Local dataset

Files

Directory

Loading entire directory

Loading specific files in directory

HuggingFace Hub

Folder uploaded

HuggingFace Dataset

Remote Filesystems

S3

GCS

Azure

Gen 1

Gen 2

OCI

HTTPS

Next steps