MultiModal / Vision Language Models (BETA)

Supported Models

Mllama
Llama4
Pixtral
Llava-1.5
Mistral-Small-3.1
Gemma-3
Qwen2-VL
Qwen2.5-VL

Usage

Multimodal support is limited and doesn’t have full feature parity.

Here are the hyperparams you’ll need to use to finetune a multimodal model.

processor_type: AutoProcessor

skip_prepare_dataset: true
remove_unused_columns: false  # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false  # not yet supported with multimodal

chat_template:  # see in next section

# example dataset
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
    field_messages: messages

# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear

Please see examples folder for full configs.

Warning

Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.

Mllama

base_model: meta-llama/Llama-3.2-11B-Vision-Instruct

chat_template: llama3_2_vision

Llama4

base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct

chat_template: llama4

Pixtral

base_model: mistralai/Pixtral-12B-2409

chat_template: pixtral

Llava-1.5

base_model: llava-hf/llava-1.5-7b-hf

chat_template: llava

Mistral-Small-3.1

base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503

chat_template: mistral_v7_tekken

Gemma-3

Tip

The Gemma3-1B model is a text-only model, so please train as regular text model.

For multi-modal 4B/12B/27B models, use the following config:

base_model: google/gemma-3-4b-it

chat_template: gemma3

Qwen2-VL

base_model: Qwen/Qwen2-VL-7B-Instruct

chat_template: qwen2_vl

Qwen2.5-VL

base_model: Qwen/Qwen2.5-VL-7B-Instruct

chat_template: qwen2_vl  # same as qwen2-vl

Dataset Format

For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.

A message is a list of role and content.
role can be system, user, assistant, etc.
content is a list of type and (text or image or path or url or base64).

Note

For backwards compatibility:

If the dataset has a images or image column of list[Image], it will be appended to the first content list as {"type": "image", "image": ...}. However, if the content already has a {"type": "image"} but no image key, it will be set the image key.
If content is a string, it will be converted to a list with type as text.

Tip

For image loading, you can use the following keys within content alongside "type": "image":

"path": "/path/to/image.jpg"
"url": "https://example.com/image.jpg"
"base64": "..."
"image": PIL.Image

Here is an example of a multi-modal dataset:

[
  {
    "messages": [
        {
            "role": "system",
            "content": [
              {"type": "text", "text": "You are a helpful assistant."}
              ]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        },
        {
            "role": "assistant",
            "content": [
              {"type": "text", "text": "The image is a bee."}
            ]
        }
    ]
  }
]