integrations.liger.models.llama4
integrations.liger.models.llama4
Liger FLCE for llama4
Functions
| Name | Description |
|---|---|
| apply_liger_kernel_to_llama4 | Apply Liger kernels to replace original implementation in HuggingFace Llama models (2 and 3) |
| lce_forward |
apply_liger_kernel_to_llama4
integrations.liger.models.llama4.apply_liger_kernel_to_llama4(
cross_entropy=False,
fused_linear_cross_entropy=False,
rms_norm=False,
glu_activation=False,
layer_norm=False,
**kwargs,
)Apply Liger kernels to replace original implementation in HuggingFace Llama models (2 and 3)
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| cross_entropy | bool | Whether to apply Liger’s cross entropy loss. Default is False. | False |
| fused_linear_cross_entropy | bool | Whether to apply Liger’s fused linear cross entropy loss. Default is False. cross_entropy and fused_linear_cross_entropy cannot both be False. If fused_linear_cross_entropy is True, the logits will not be materialized but more memory efficient. |
False |
| rms_norm | bool | Whether to apply Liger’s RMSNorm. Default is False. | False |
| glu_activation | bool | Whether to apply Liger’s SwiGLU MLP. Default is False. | False |
| layer_norm | bool | Whether to apply Liger’s LayerNorm. Default is False. | False |
lce_forward
integrations.liger.models.llama4.lce_forward(
self,
input_ids=None,
attention_mask=None,
position_ids=None,
past_key_values=None,
inputs_embeds=None,
labels=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
cache_position=None,
logits_to_keep=0,
**loss_kwargs,
)Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| labels | torch.LongTensor of shape (batch_size, sequence_length), optional |
Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]. |
None |
| logits_to_keep | int or torch.Tensor, optional |
If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length). |
0 |
Returns: