Dataset Loading

Understanding how to load datasets from different sources

Overview

Datasets can be loaded in a number of different ways depending on the how it is saved (the extension of the file) and where it is stored.

Loading Datasets

We use the datasets library to load datasets and a mix of load_dataset and load_from_disk to load them.

You may recognize the similar named configs between load_dataset and the datasets section of the config file.

datasets:
  - path:
    name:
    data_files:
    split:
    revision:
    trust_remote_code:
Tip

Do not feel overwhelmed by the number of options here. A lot of them are optional. In fact, the most common config to use would be path and sometimes data_files.

This matches the API of datasets.load_dataset, so if you’re familiar with that, you will feel right at home.

For HuggingFace’s guide to load different dataset types, see here.

For full details on the config, see config.qmd.

Note

You can set multiple datasets in the config file by more than one entry under datasets.

datasets:
  - path: /path/to/your/dataset
  - path: /path/to/your/other/dataset

Local dataset

Files

Usually, to load a JSON file, you would do something like this:

from datasets import load_dataset

dataset = load_dataset("json", data_files="data.json")

Which translates to the following config:

datasets:
  - path: json
    data_files: /path/to/your/file.jsonl

However, to make things easier, we have added a few shortcuts for loading local dataset files.

You can just point the path to the file or directory along with the ds_type to load the dataset. The below example shows for a JSON file:

datasets:
  - path: /path/to/your/file.jsonl
    ds_type: json

This works for CSV, JSON, Parquet, and Arrow files.

Tip

If path points to a file and ds_type is not specified, we will automatically infer the dataset type from the file extension, so you could omit ds_type if you’d like.

Directory

If you’re loading a directory, you can point the path to the directory.

Then, you have two options:

Loading entire directory

You do not need any additional configs.

We will attempt to load in the following order: - datasets saved with datasets.save_to_disk - loading entire directory of files (such as with parquet/arrow files)

datasets:
  - path: /path/to/your/directory
Loading specific files in directory

Provide data_files with a list of files to load.

datasets:
    # single file
  - path: /path/to/your/directory
    ds_type: csv
    data_files: file1.csv

    # multiple files
  - path: /path/to/your/directory
    ds_type: json
    data_files:
      - file1.jsonl
      - file2.jsonl

    # multiple files for parquet
  - path: /path/to/your/directory
    ds_type: parquet
    data_files:
      - file1.parquet
      - file2.parquet

HuggingFace Hub

The method you use to load the dataset depends on how the dataset was created, whether a folder was uploaded directly or a HuggingFace Dataset was pushed.

Note

If you’re using a private dataset, you will need to enable the hf_use_auth_token flag in the root-level of the config file.

Folder uploaded

This would mean that the dataset is a single file or file(s) uploaded to the Hub.

datasets:
  - path: org/dataset-name
    data_files:
      - file1.jsonl
      - file2.jsonl

HuggingFace Dataset

This means that the dataset is created as a HuggingFace Dataset and pushed to the Hub via datasets.push_to_hub.

datasets:
  - path: org/dataset-name
Note

There are some other configs which may be required like name, split, revision, trust_remote_code, etc depending on the dataset.

Remote Filesystems

Via the storage_options config under load_dataset, you can load datasets from remote filesystems like S3, GCS, Azure, and OCI.

Warning

This is currently experimental. Please let us know if you run into any issues!

The only difference between the providers is that you need to prepend the path with the respective protocols.

datasets:
    # Single file
  - path: s3://bucket-name/path/to/your/file.jsonl

    # Directory
  - path: s3://bucket-name/path/to/your/directory

For directory, we load via load_from_disk.

S3

Prepend the path with s3://.

The credentials are pulled in the following order:

  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables
  • from the ~/.aws/credentials file
  • for nodes on EC2, the IAM metadata provider
Note

We assume you have credentials setup and not using anonymous access. If you want to use anonymous access, let us know! We may have to open a config option for this.

Other environment variables that can be set can be found in boto3 docs

GCS

Prepend the path with gs:// or gcs://.

The credentials are loaded in the following order:

  • gcloud credentials
  • for nodes on GCP, the google metadata service
  • anonymous access

Azure

Gen 1

Prepend the path with adl://.

Ensure you have the following environment variables set:

  • AZURE_STORAGE_TENANT_ID
  • AZURE_STORAGE_CLIENT_ID
  • AZURE_STORAGE_CLIENT_SECRET
Gen 2

Prepend the path with abfs:// or az://.

Ensure you have the following environment variables set:

  • AZURE_STORAGE_ACCOUNT_NAME
  • AZURE_STORAGE_ACCOUNT_KEY

Other environment variables that can be set can be found in adlfs docs

OCI

Prepend the path with oci://.

It would attempt to read in the following order:

  • OCIFS_IAM_TYPE, OCIFS_CONFIG_LOCATION, and OCIFS_CONFIG_PROFILE environment variables
  • when on OCI resource, resource principal

Other environment variables:

  • OCI_REGION_METADATA

Please see the ocifs docs.

HTTPS

The path should start with https://.

datasets:
  - path: https://path/to/your/dataset/file.jsonl

This must be publically accessible.

Next steps

Now that you know how to load datasets, you can learn more on how to load your specific dataset format into your target output format dataset formats docs.