integrations.expert_parallel.shard

integrations.expert_parallel.shard

Generic expert-weight sharding for @use_experts_implementation modules.

After this runs (in post_model_build, before FSDP wraps), each rank’s Experts modules hold only their local slice of the experts dim. The registered deep_ep_* forward function then handles dispatch -> local compute -> combine.

Functions

Name	Description
ep_adapter_load_local_shard	Slice an EP-composition expert-LoRA adapter from rank-0’s GLOBAL (all-experts) tensor down to
gather_expert_lora_full	Inverse of the EP LoRA slice: all-gather a local-experts LoRA tensor across the
save_ep_lora_adapter	Write a complete LoRA adapter when experts are EP-sharded.
save_fsdp2_lora_adapter	Write a complete LoRA adapter under FSDP2 WITHOUT expert parallelism.
shard_expert_lora	Slice PEFT `target_parameters` expert LoRA to each rank’s local experts.
shard_expert_weights	Slice expert weights along dim 0 per the EP rank.

ep_adapter_load_local_shard

integrations.expert_parallel.shard.ep_adapter_load_local_shard(
    global_adapter,
    ep_dim,
    e_global,
    ep_coord,
    ep_size,
    placements,
    dp_size,
    dp_rank,
)

Slice an EP-composition expert-LoRA adapter from rank-0’s GLOBAL (all-experts) tensor down to THIS rank’s local FSDP shard — the inverse of shard_expert_lora + the FSDP dp/cp sharding, used by the cpu_ram_efficient load path. First take this ep-group’s experts, then this rank’s dp/cp shard along each Shard placement.

ep_dim is the adapter’s expert axis: 0 for lora_A’s expert-major [E*r, in] rows, 1 for lora_B’s [out, r*E] columns. lora_B’s experts are the LAST axis of the [out, r, E] view (NOT contiguous in the flat r*E dim), so a plain chunk on dim 1 would pick a rank-component, not this ep-group’s experts — hence the reshape-slice that mirrors :func:_slice_expert_lora_param.

gather_expert_lora_full

integrations.expert_parallel.shard.gather_expert_lora_full(
    local,
    kind,
    e_global,
    ep_group,
)

Inverse of the EP LoRA slice: all-gather a local-experts LoRA tensor across the EP group and reassemble the full e_global-expert tensor in the PEFT layout.

kind="A" (expert-major [E*r, in]): concat gathered slices along rows.
kind="B" (rank-major [out, r*E]): place each rank’s experts into the E axis of [out, r, E] and flatten.

save_ep_lora_adapter

integrations.expert_parallel.shard.save_ep_lora_adapter(
    model,
    output_dir,
    ep_group,
)

Write a complete LoRA adapter when experts are EP-sharded.

The attention/router LoRA is replicated across EP, but target_parameters expert LoRA is EP-sharded (each rank holds [offset:offset+E_local]). A plain save would persist only the local rank’s experts. This gathers each adapter param to a full tensor (FSDP all-gather via full_tensor + EP all-gather for expert LoRA), renames to PEFT adapter keys, and writes adapter_model.safetensors on rank 0. Returns True if it handled the save.

save_fsdp2_lora_adapter

integrations.expert_parallel.shard.save_fsdp2_lora_adapter(model, output_dir)

Write a complete LoRA adapter under FSDP2 WITHOUT expert parallelism.

The DCP SHARDED_STATE_DICT save fails (“Failed to validate global plan”) on the frozen NVFP4 base params (torchao tensor-subclass DTensors the planner can’t validate). For a LoRA run we only need the (tiny) adapter, so gather each lora_ param to a full tensor via FSDP all-gather (DTensor.full_tensor) and write adapter_model.safetensors on rank 0. target_parameters expert LoRA lives on PEFT ParamWrappers (lora_A/lora_B submodules) — gather those too and key by module name (no EP axis to gather here, unlike :func:save_ep_lora_adapter).

Returns True if it handled the save (model has LoRA params), else False.

shard_expert_lora

integrations.expert_parallel.shard.shard_expert_lora(model, ep_size)

Slice PEFT target_parameters expert LoRA to each rank’s local experts.

PEFT sizes the LoRA for a 3D experts.{gate_up,down}_proj from the parameter’s own dim-0 (the global expert count) at adapter-application time, before EP’s weight slice takes effect on the parameter PEFT wrapped. Left alone, the fused EP kernel (num_experts = E_local) and the FSDP2 parametrize merge both see a full-expert LoRA against a local-expert weight -> shape mismatch. This realigns the LoRA with the EP-sharded weights (same [offset:offset+E_local] slice) and registers the 1/ep_size expert grad-scale on the new params. Idempotent.

Run AFTER PEFT applies the adapter and BEFORE FSDP wraps. Returns the count of LoRA params sliced.

shard_expert_weights

integrations.expert_parallel.shard.shard_expert_weights(model, ep_group)

Slice expert weights along dim 0 per the EP rank.

Parameters

Name	Type	Description	Default
model		A built (but not yet FSDP-wrapped) HuggingFace model.	required
ep_group		`torch.distributed.ProcessGroup` for EP, or `None` for single-rank (no-op).	required

Returns

Name	Type	Description
	int	Number of Experts modules sharded (0 if EP disabled or none found).

Raises

Name	Type	Description
	ValueError	if any Experts module’s `num_experts` is not divisible by the EP world size.

DDP composition: the sharded params hold DIFFERENT content per rank, so we add their fully-qualified names to model._ddp_params_and_buffers_to_ignore to prevent the startup broadcast from copying rank 0’s slice everywhere. FSDP composition is handled in ExpertParallelPlugin.fully_shard_experts.