MultiModal / Vision Language Models (BETA)

Supported Models

Mllama
Llama4
Pixtral
Llava-1.5
Mistral-Small-3.1
Magistral-Small-2509
Voxtral
Gemma-3
Gemma-3n
Qwen2-VL
Qwen2.5-VL
SmolVLM2
LFM2-VL

Usage

Multimodal support is limited and doesn’t have full feature parity.

Here are the hyperparams you’ll need to use to finetune a multimodal model.

processor_type: AutoProcessor

skip_prepare_dataset: true
remove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false  # not yet supported with multimodal

chat_template:  # see in next section if specified

# example dataset
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]

# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear

Please see examples folder for full configs.

Tip

Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.

Note

As of now, we do not truncate nor drop samples based on sequence_len as each arch has different ways to process non-text tokens. We are looking for help on this.

Mllama

base_model: meta-llama/Llama-3.2-11B-Vision-Instruct

chat_template: llama3_2_vision

Llama4

base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct

chat_template: llama4

Pixtral

base_model: mistralai/Pixtral-12B-2409

chat_template: pixtral

Llava-1.5

base_model: llava-hf/llava-1.5-7b-hf

chat_template: llava

Mistral-Small-3.1

Tip

Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'

base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

Magistral-Small-2509

Tip

Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'

base_model: mistralai/Magistral-Small-2509

Voxtral

Tip

Please make sure to install audio lib via pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'

base_model: mistralai/Voxtral-Mini-3B-2507

Gemma-3

Tip

The Gemma3-1B model is a text-only model, so please train as regular text model.

For multi-modal 4B/12B/27B models, use the following config:

base_model: google/gemma-3-4b-it

chat_template: gemma3

Gemma-3n

Warning

The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.

Tip

Please make sure to install timm via pip3 install timm==1.0.17

base_model: google/gemma-3n-E2B-it

chat_template: gemma3n

Qwen2-VL

base_model: Qwen/Qwen2-VL-7B-Instruct

chat_template: qwen2_vl

Qwen2.5-VL

base_model: Qwen/Qwen2.5-VL-7B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl

Qwen3-VL

base_model: Qwen/Qwen3-VL-4B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl

SmolVLM2

Tip

Please make sure to install num2words via pip3 install num2words==0.5.14

base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct

LFM2-VL

Warning

Please uninstall causal-conv1d via pip3 uninstall -y causal-conv1d

base_model: LiquidAI/LFM2-VL-450M

Dataset Format

For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.

A message is a list of role and content.
role can be system, user, assistant, etc.
content is a list of type and (text, image, path, url, base64, or audio).

Image

Note

For backwards compatibility:

If the dataset has a images or image column of list[Image], it will be appended to the first content list as {"type": "image", "image": ...}. However, if the content already has a {"type": "image"} but no image key, it will be set the image key.
If content is a string, it will be converted to a list with type as text.

For image loading, you can use the following keys within content alongside "type": "image":

"path": "/path/to/image.jpg"
"url": "https://example.com/image.jpg"
"base64": "..."
"image": PIL.Image

Audio

For audio loading, you can use the following keys within content alongside "type": "audio":

"path": "/path/to/audio.mp3"
"url": "https://example.com/audio.mp3"
"audio": np.ndarray

Tip

You may need to install librosa via pip3 install librosa==0.11.0.

Video

Warning

This is not well tested at the moment. We welcome contributors!

For video loading, you can use the following keys within content alongside "type": "video":

"path": "/path/to/video.mp4"
"url": "https://example.com/video.mp4"
"video": np.ndarray | list[PIL.Image.Image] | torch.Tensor (or list of the aforementioned)

Example

Here is an example of a multi-modal dataset:

[
  {
    "messages": [
        {
            "role": "system",
            "content": [
              {"type": "text", "text": "You are a helpful assistant."}
              ]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
        {
            "role": "assistant",
            "content": [
              {"type": "text", "text": "The image is a bee."}
            ]
        }
    ]
  }
]

FAQ

PIL.UnidentifiedImageError: cannot identify image file ...

PIL could not retrieve the file at url using requests. Please check for typo. One alternative reason is that the request is blocked by the server.