monkeypatch.models.qwen3_5.modeling
monkeypatch.models.qwen3_5.modeling
Monkeypatch for Qwen3_5 and Qwen3_5Moe models to pass position_ids to linear attention.
Functions
| Name | Description |
|---|---|
| get_cu_seqlens | Compute cumulative sequence lengths from position_ids for FLA varlen kernels. |
| patch_qwen3_5_vlm_flash_attention | Patch _is_packed_sequence to handle Qwen3.5’s 3-D MRoPE position_ids. |
get_cu_seqlens
monkeypatch.models.qwen3_5.modeling.get_cu_seqlens(position_ids)Compute cumulative sequence lengths from position_ids for FLA varlen kernels.
Adapted from transformers.modeling_flash_attention_utils.prepare_fa_kwargs_from_position_ids. https://github.com/huggingface/transformers/blob/0f1b128d3359a26bd18be99c26d7f04fb3cba914/src/transformers/modeling_flash_attention_utils.py#L316
Qwen3.5 uses MRoPE: position_ids arrive as [axes, B, T]. All axes carry the same temporal positions, so axis 0 is used to recover the [B, T] layout. See: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_5/modeling_qwen3_5.py
patch_qwen3_5_vlm_flash_attention
monkeypatch.models.qwen3_5.modeling.patch_qwen3_5_vlm_flash_attention()Patch _is_packed_sequence to handle Qwen3.5’s 3-D MRoPE position_ids.
transformers passes position_ids as [axes, B, T] to decoder layers, but _is_packed_sequence only handles 2-D tensors and mis-classifies the 3-D shape as a packed-sequence indicator, causing CUDA errors in the varlen path.