Streaming Datasets
Streaming enables memory-efficient training with large datasets by loading data incrementally rather than loading the entire dataset into memory at once.
Use streaming when:
- Your dataset is too large to fit in memory (e.g. when you’re doing pretraining with massive text corpora)
- You want to start training immediately without preprocessing the entire dataset
Streaming works with both remote and locally stored datasets!
Streaming currently only supports a single dataset. Multi-dataset support will be added soon.
Configuration
Basic Streaming
Enable streaming mode by setting the streaming
flag:
streaming: true
Pretraining with Streaming
For pretraining tasks, streaming is automatically enabled when using pretraining_dataset
:
pretraining_dataset:
- path: HuggingFaceFW/fineweb-edu
type: pretrain
text_column: text
split: train
# Optionally, enable sample packing
streaming_multipack_buffer_size: 10000
sample_packing: true
SFT with Streaming
For supervised fine-tuning with streaming:
streaming: true
datasets:
- path: tatsu-lab/alpaca
type: alpaca
split: train
# Optionally, enable sample packing
streaming_multipack_buffer_size: 10000
sample_packing: true
Configuration Options
streaming_multipack_buffer_size
Controls the buffer size for multipack streaming (default: 10,000). This determines how many samples are buffered before packing. Larger buffers can improve packing efficiency but use more memory.
shuffle_merged_datasets
When enabled, shuffles the streaming dataset using the buffer. This requires additional memory for the shuffle buffer.
Sample Packing with Streaming
Sample packing is supported for streaming datasets. When enabled, multiple samples are packed into a single sequence to maximize GPU utilization:
sample_packing: true
streaming_multipack_buffer_size: 10000
# For SFT: attention is automatically isolated between packed samples
# For pretraining: control with pretrain_multipack_attn
pretrain_multipack_attn: true # prevent cross-attention between packed samples
For more information, see our documentation on multipacking.
Important Considerations
Memory Usage
While streaming reduces memory usage compared to loading entire datasets, you still need to consider:
- You can control the memory usage by adjusting
streaming_multipack_buffer_size
- Sample packing requires buffering multiple samples
- Shuffling requires additional memory for the shuffle buffer
Performance
- Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly
- Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively
- Consider using
axolotl preprocess
for smaller or more frequently used datasets
Evaluation Datasets
Evaluation datasets are not streamed to ensure consistent evaluation metrics. They’re loaded normally even when training uses streaming.
Examples
See the examples/streaming/
directory for complete configuration examples:
pretrain.yaml
: Pretraining with streaming datasetsft.yaml
: Supervised fine-tuning with streaming