integrations.liger.models.qwen3_5_moe

integrations.liger.models.qwen3_5_moe

Liger FLCE for Qwen3.5 MoE. Based on transformers v5.3.0.

Functions

Name	Description
apply_liger_kernel_to_qwen3_5_moe	Apply Liger kernels to replace original implementation in HuggingFace Qwen3.5 MoE models.
lce_forward

apply_liger_kernel_to_qwen3_5_moe

integrations.liger.models.qwen3_5_moe.apply_liger_kernel_to_qwen3_5_moe(
    cross_entropy=False,
    fused_linear_cross_entropy=False,
    rms_norm=False,
    rms_norm_gated=False,
    glu_activation=False,
    layer_norm=False,
    **kwargs,
)

Apply Liger kernels to replace original implementation in HuggingFace Qwen3.5 MoE models.

Note: Qwen3_5MoeRMSNorm uses zero-init weight with offset 1.0 (like Gemma), so we use LigerRMSNorm with offset=1.0 and init_fn=“zeros”.

Parameters

Name	Type	Description	Default
cross_entropy	bool	Whether to apply Liger’s cross entropy loss. Default is False.	`False`
fused_linear_cross_entropy	bool	Whether to apply Liger’s fused linear cross entropy loss. Default is False. `cross_entropy` and `fused_linear_cross_entropy` cannot both be True. If `fused_linear_cross_entropy` is True, the logits will not be materialized but more memory efficient.	`False`
rms_norm	bool	Whether to apply Liger’s RMSNorm. Default is False.	`False`
rms_norm_gated	bool	Whether to apply fused RMSNorm+SiLU gate kernel for Qwen3_5MoeRMSNormGated (used in linear attention layers). Default is False.	`False`
glu_activation	bool	Whether to apply Liger’s SwiGLU MLP. Default is False.	`False`
layer_norm	bool	Whether to apply Liger’s LayerNorm. Default is False.	`False`

lce_forward

integrations.liger.models.qwen3_5_moe.lce_forward(
    self,
    input_ids=None,
    attention_mask=None,
    position_ids=None,
    past_key_values=None,
    inputs_embeds=None,
    labels=None,
    use_cache=None,
    output_router_logits=None,
    cache_position=None,
    logits_to_keep=0,
    **kwargs,
)

Parameters

Name	Type	Description	Default
labels	`torch.LongTensor` of shape `(batch_size, sequence_length)`, optional	Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.	`None`
logits_to_keep	`int` or `torch.Tensor`, optional	If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).	`0`

Returns: