integrations.expert_parallel.plugin

integrations.expert_parallel.plugin

Expert-Parallel (DeepEP) plugin for axolotl.

Classes

Name Description
ExpertParallelPlugin Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels.

ExpertParallelPlugin

integrations.expert_parallel.plugin.ExpertParallelPlugin()

Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels.

Methods

Name Description
fully_shard_experts Pre-wrap each Experts module with FSDP on the dp_shard axis.
post_model_load Propagate DDP-ignored params to the outermost model wrapper.
fully_shard_experts
integrations.expert_parallel.plugin.ExpertParallelPlugin.fully_shard_experts(
    model,
    dp_shard_mesh,
    fsdp2_kwargs,
)

Pre-wrap each Experts module with FSDP on the dp_shard axis.

Called from the patched fsdp2_prepare_model BEFORE the outer auto-wrap so experts become FSDPModules and the auto-wrap walker skips them. Inherits the outer wrap’s policy (mp, offload, reshard) so inner/outer collective dtypes line up; only mesh is overridden.

post_model_load
integrations.expert_parallel.plugin.ExpertParallelPlugin.post_model_load(
    cfg,
    model,
)

Propagate DDP-ignored params to the outermost model wrapper.

post_model_build set _ddp_params_and_buffers_to_ignore on the inner model. After PEFT wraps it (in PeftModel), DDP wraps PeftModel, but DDP looks for the attribute on the top-level module — which is now PeftModel, not our inner model. Mirror the list up.