Gradient Checkpointing and Activation Offloading

Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning models by reducing the memory footprint and improving computational efficiency.

Enabling Gradient Checkpointing

gradient_checkpointing: true

Enabling Activation Offloading

gradient_checkpointing: true  # required for activation offloading
activation_offloading: true

Activation offloading variants:

The default activation_offloading: true offloads activations to CPU and uses CUDA streams to overlap the communications and computations when offloading.

The activation_offloading: legacy naively offloads activations to CPU and without additional optimizations.

For resource constrained environments with limited CPU memory, activation_offloading: disk offloads activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.