Streaming Datasets

How to use streaming mode for large-scale datasets and memory-efficient training

Streaming enables memory-efficient training with large datasets by loading data incrementally rather than loading the entire dataset into memory at once.

Use streaming when:

Your dataset is too large to fit in memory (e.g. when you’re doing pretraining with massive text corpora)
You want to start training immediately without preprocessing the entire dataset

Streaming works with both remote and locally stored datasets!

Note

Streaming currently only supports a single dataset. Multi-dataset support will be added soon.

Configuration

Basic Streaming

Enable streaming mode by setting the streaming flag:

streaming: true

Pretraining with Streaming

For pretraining tasks, streaming is automatically enabled when using pretraining_dataset:

pretraining_dataset:
  - path: HuggingFaceFW/fineweb-edu
    type: pretrain
    text_column: text
    split: train

# Optionally, enable sample packing
streaming_multipack_buffer_size: 10000
sample_packing: true

SFT with Streaming

For supervised fine-tuning with streaming:

streaming: true
datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
    split: train

# Optionally, enable sample packing
streaming_multipack_buffer_size: 10000
sample_packing: true

Configuration Options

`streaming_multipack_buffer_size`

Controls the buffer size for multipack streaming (default: 10,000). This determines how many samples are buffered before packing. Larger buffers can improve packing efficiency but use more memory.

`shuffle_merged_datasets`

When enabled, shuffles the streaming dataset using the buffer. This requires additional memory for the shuffle buffer.

Sample Packing with Streaming

Sample packing is supported for streaming datasets. When enabled, multiple samples are packed into a single sequence to maximize GPU utilization:

sample_packing: true
streaming_multipack_buffer_size: 10000

# For SFT: attention is automatically isolated between packed samples
# For pretraining: control with pretrain_multipack_attn
pretrain_multipack_attn: true  # prevent cross-attention between packed samples

For more information, see our documentation on multipacking.

Important Considerations

Memory Usage

While streaming reduces memory usage compared to loading entire datasets, you still need to consider:

You can control the memory usage by adjusting streaming_multipack_buffer_size
Sample packing requires buffering multiple samples
Shuffling requires additional memory for the shuffle buffer

Performance

Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
Consider using axolotl preprocess for smaller or more frequently used datasets

Evaluation Datasets

Evaluation datasets are not streamed to ensure consistent evaluation metrics. They’re loaded normally even when training uses streaming.

Examples

See the examples/streaming/ directory for complete configuration examples:

pretrain.yaml: Pretraining with streaming dataset
sft.yaml: Supervised fine-tuning with streaming