integrations.expert_parallel.plugin
integrations.expert_parallel.plugin
Expert-Parallel (DeepEP) plugin for axolotl.
Classes
| Name | Description |
|---|---|
| ExpertParallelPlugin | Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels. |
ExpertParallelPlugin
integrations.expert_parallel.plugin.ExpertParallelPlugin()Plugin that swaps MoE dispatch/combine for DeepEP-fused kernels.
Methods
| Name | Description |
|---|---|
| fully_shard_experts | Pre-wrap each Experts module with FSDP on the dp_shard axis. |
| post_model_load | Propagate DDP-ignored params to the outermost model wrapper. |
fully_shard_experts
integrations.expert_parallel.plugin.ExpertParallelPlugin.fully_shard_experts(
model,
dp_shard_mesh,
fsdp2_kwargs,
)Pre-wrap each Experts module with FSDP on the dp_shard axis.
Called from the patched fsdp2_prepare_model BEFORE the outer auto-wrap
so experts become FSDPModules and the auto-wrap walker skips them.
Inherits the outer wrap’s policy (mp, offload, reshard) so inner/outer
collective dtypes line up; only mesh is overridden.
post_model_load
integrations.expert_parallel.plugin.ExpertParallelPlugin.post_model_load(
cfg,
model,
)Propagate DDP-ignored params to the outermost model wrapper.
post_model_build set _ddp_params_and_buffers_to_ignore on the inner
model. After PEFT wraps it (in PeftModel), DDP wraps PeftModel, but
DDP looks for the attribute on the top-level module — which is now
PeftModel, not our inner model. Mirror the list up.