MultiModal / Vision Language Models (BETA)
Supported Models
Usage
Multimodal support is limited and doesn’t have full feature parity.
Here are the hyperparams you’ll need to use to finetune a multimodal model.
processor_type: AutoProcessor
skip_prepare_dataset: true
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false # not yet supported with multimodal
chat_template: # see in next section
# example dataset
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
field_messages: messages
# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'language_model.model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinear
Please see examples folder for full configs.
Warning
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
Mllama
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
chat_template: llama3_2_vision
Llama4
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
chat_template: llama4
Pixtral
base_model: mistralai/Pixtral-12B-2409
chat_template: pixtral
Llava-1.5
base_model: llava-hf/llava-1.5-7b-hf
chat_template: llava
Mistral-Small-3.1
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503
chat_template: mistral_v7_tekken
Gemma-3
Tip
The Gemma3-1B model is a text-only model, so please train as regular text model.
For multi-modal 4B/12B/27B models, use the following config:
base_model: google/gemma-3-4b-it
chat_template: gemma3
Qwen2-VL
base_model: Qwen/Qwen2-VL-7B-Instruct
chat_template: qwen2_vl
Qwen2.5-VL
base_model: Qwen/Qwen2.5-VL-7B-Instruct
chat_template: qwen2_vl # same as qwen2-vl
Dataset Format
For multi-modal datasets, we adopt an extended chat_template
format similar to OpenAI’s Message format.
- A message is a list of
role
andcontent
. role
can besystem
,user
,assistant
, etc.content
is a list oftype
and (text
orimage
orpath
orurl
orbase64
).
Note
For backwards compatibility:
- If the dataset has a
images
orimage
column oflist[Image]
, it will be appended to the firstcontent
list as{"type": "image", "image": ...}
. However, if the content already has a{"type": "image"}
but noimage
key, it will be set theimage
key. - If
content
is a string, it will be converted to a list withtype
astext
.
Tip
For image loading, you can use the following keys within content
alongside "type": "image"
:
"path": "/path/to/image.jpg"
"url": "https://example.com/image.jpg"
"base64": "..."
"image": PIL.Image
Here is an example of a multi-modal dataset:
[
{
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "The image is a bee."}
]
}
]
}
]