monkeypatch.gradient_checkpointing.offload_disk
monkeypatch.gradient_checkpointing.offload_disk
DISCO - DIsk-based Storage and Checkpointing with Optimized prefetching
Classes
Name | Description |
---|---|
Disco | Disco: DIsk-based Storage and Checkpointing with Optimized prefetching |
DiskOffloadManager | Manages offloaded tensors and handles prefetching in a separate thread. |
Disco
monkeypatch.gradient_checkpointing.offload_disk.Disco()
Disco: DIsk-based Storage and Checkpointing with Optimized prefetching Advanced disk-based gradient checkpointer with prefetching.
Methods
Name | Description |
---|---|
backward | Backward pass that loads activations from disk with prefetching |
forward | Forward pass that offloads activations to disk asynchronously |
get_instance | Get or create the offload manager |
backward
monkeypatch.gradient_checkpointing.offload_disk.Disco.backward(
ctx,*grad_outputs,
)
Backward pass that loads activations from disk with prefetching
forward
monkeypatch.gradient_checkpointing.offload_disk.Disco.forward(
ctx,
forward_function,
hidden_states,*args,
=1,
prefetch_size=True,
prefetch_to_gpu=4,
save_workers )
Forward pass that offloads activations to disk asynchronously
get_instance
monkeypatch.gradient_checkpointing.offload_disk.Disco.get_instance(=1,
prefetch_size=True,
prefetch_to_gpu=4,
save_workers )
Get or create the offload manager
DiskOffloadManager
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager(=3,
prefetch_size=True,
prefetch_to_gpu=4,
save_workers )
Manages offloaded tensors and handles prefetching in a separate thread. Includes synchronization to prevent race conditions.
Methods
Name | Description |
---|---|
cleanup | Clean up all temp files and stop prefetch thread with proper synchronization |
cleanup_tensor | Clean up a specific tensor file after it’s been used |
load_tensor | Load tensor from disk or prefetch cache with proper synchronization |
save_tensor | Save tensor to disk asynchronously and return file path with thread-safe operations |
trigger_prefetch | Trigger prefetching of the next N tensors with proper synchronization |
wait_for_save | Wait for a tensor to be saved to disk |
cleanup
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.cleanup()
Clean up all temp files and stop prefetch thread with proper synchronization
cleanup_tensor
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.cleanup_tensor(
file_path, )
Clean up a specific tensor file after it’s been used
load_tensor
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.load_tensor(
file_path,='cuda',
target_device )
Load tensor from disk or prefetch cache with proper synchronization
save_tensor
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.save_tensor(
tensor, )
Save tensor to disk asynchronously and return file path with thread-safe operations
trigger_prefetch
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.trigger_prefetch(=None,
n )
Trigger prefetching of the next N tensors with proper synchronization
wait_for_save
monkeypatch.gradient_checkpointing.offload_disk.DiskOffloadManager.wait_for_save(
file_path,=None,
timeout )
Wait for a tensor to be saved to disk