monkeypatch.models.qwen3_5.modeling

monkeypatch.models.qwen3_5.modeling

Monkeypatch for Qwen3_5 and Qwen3_5Moe models to pass position_ids to linear attention.

Functions

Name Description
get_cu_seqlens Compute cumulative sequence lengths from position_ids for FLA varlen kernels.
patch_qwen3_5_vlm_flash_attention Patch _is_packed_sequence to handle Qwen3.5’s 3-D MRoPE position_ids.

get_cu_seqlens

monkeypatch.models.qwen3_5.modeling.get_cu_seqlens(position_ids)

Compute cumulative sequence lengths from position_ids for FLA varlen kernels.

Adapted from transformers.modeling_flash_attention_utils.prepare_fa_kwargs_from_position_ids. https://github.com/huggingface/transformers/blob/0f1b128d3359a26bd18be99c26d7f04fb3cba914/src/transformers/modeling_flash_attention_utils.py#L316

Qwen3.5 uses MRoPE: position_ids arrive as [axes, B, T]. All axes carry the same temporal positions, so axis 0 is used to recover the [B, T] layout. See: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_5/modeling_qwen3_5.py

patch_qwen3_5_vlm_flash_attention

monkeypatch.models.qwen3_5.modeling.patch_qwen3_5_vlm_flash_attention()

Patch _is_packed_sequence to handle Qwen3.5’s 3-D MRoPE position_ids.

transformers passes position_ids as [axes, B, T] to decoder layers, but _is_packed_sequence only handles 2-D tensors and mis-classifies the 3-D shape as a packed-sequence indicator, causing CUDA errors in the varlen path.