integrations.expert_parallel.plugin

integrations.expert_parallel.plugin

Expert-Parallel (DeepEP) plugin for axolotl.

Classes

Name	Description
ExpertParallelPlugin	Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels.

ExpertParallelPlugin

integrations.expert_parallel.plugin.ExpertParallelPlugin()

Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels.

Methods

Name	Description
fully_shard_experts	Pre-wrap each Experts module with FSDP on the `dp_shard` axis.
post_model_load	Propagate DDP-ignored params to the outermost model wrapper.

fully_shard_experts

integrations.expert_parallel.plugin.ExpertParallelPlugin.fully_shard_experts(
    model,
    dp_shard_mesh,
    fsdp2_kwargs,
)

Pre-wrap each Experts module with FSDP on the dp_shard axis.

Called from the patched fsdp2_prepare_model BEFORE the outer auto-wrap so experts become FSDPModules and the auto-wrap walker skips them. Inherits the outer wrap’s policy (mp, offload, reshard) so inner/outer collective dtypes line up; only mesh is overridden.

post_model_load

integrations.expert_parallel.plugin.ExpertParallelPlugin.post_model_load(
    cfg,
    model,
)

Propagate DDP-ignored params to the outermost model wrapper.

post_model_build set _ddp_params_and_buffers_to_ignore on the inner model. After PEFT wraps it (in PeftModel), DDP wraps PeftModel, but DDP looks for the attribute on the top-level module — which is now PeftModel, not our inner model. Mirror the list up.

Functions

Name	Description
expert_shard_axis	The non-`ep` mesh axis the routed experts FSDP-shard on under EP composition, or `None`.

expert_shard_axis

integrations.expert_parallel.plugin.expert_shard_axis(mesh_dim_names)

The non-ep mesh axis the routed experts FSDP-shard on under EP composition, or None.

Prefers dp_shard (EP×dp_shard: experts shard on the data axis); falls back to cp (EP×cp, where the cp ranks of an ep-group hold the SAME experts since cp shards the sequence, not the experts, so FSDP-sharding them on cp keeps each rank from holding the full ep-group slice). Returns None for pure EP (no secondary axis) or when there is no ep axis to compose with — those paths don’t pre-wrap the experts here.