MultiModal / Vision Language Models (BETA)
Supported Models
Usage
Multimodal support is limited and doesn’t have full feature parity.
Here are the hyperparams you’ll need to use to finetune a multimodal model.
processor_type: AutoProcessor
skip_prepare_dataset: true
remove_unused_columns: false # leave columns in place as they are needed to handle image embeddings during training
sample_packing: false # not yet supported with multimodal
chat_template: # see in next section if specified
# example dataset
datasets:
- path: HuggingFaceH4/llava-instruct-mix-vsft
type: chat_template
split: train[:1%]
# (optional) if doing lora, only finetune the Language model,
# leave the vision model and vision tower frozen
# load_in_8bit: true
adapter: lora
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'
# (optional) if you want to resize images to a set size
image_size: 512
image_resize_algorithm: bilinearPlease see examples folder for full configs.
Some of our chat_templates have been extended to support broader dataset types. This should not break any existing configs.
As of now, we do not truncate nor drop samples based on sequence_len as each arch has different ways to process non-text tokens. We are looking for help on this.
Mllama
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
chat_template: llama3_2_visionLlama4
base_model: meta-llama/Llama-4-Scout-17B-16E-Instruct
chat_template: llama4Pixtral
base_model: mistralai/Pixtral-12B-2409
chat_template: pixtralLlava-1.5
base_model: llava-hf/llava-1.5-7b-hf
chat_template: llavaMistral-Small-3.1
Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
base_model: mistralai/Mistral-Small-3.1-24B-Instruct-2503Magistral-Small-2509
Please make sure to install vision lib via pip install 'mistral-common[opencv]==1.8.5'
base_model: mistralai/Magistral-Small-2509Voxtral
Please make sure to install audio lib via pip3 install librosa==0.11.0 'mistral_common[audio]==1.8.3'
base_model: mistralai/Voxtral-Mini-3B-2507Gemma-3
The Gemma3-1B model is a text-only model, so please train as regular text model.
For multi-modal 4B/12B/27B models, use the following config:
base_model: google/gemma-3-4b-it
chat_template: gemma3Gemma-3n
The model’s initial loss and grad norm will be very high. We suspect this to be due to the Conv in the vision layers.
Please make sure to install timm via pip3 install timm==1.0.17
base_model: google/gemma-3n-E2B-it
chat_template: gemma3nQwen2-VL
base_model: Qwen/Qwen2-VL-7B-Instruct
chat_template: qwen2_vlQwen2.5-VL
base_model: Qwen/Qwen2.5-VL-7B-Instruct
chat_template: qwen2_vl # same as qwen2-vlQwen3-VL
base_model: Qwen/Qwen3-VL-4B-Instruct
chat_template: qwen2_vl # same as qwen2-vlSmolVLM2
Please make sure to install num2words via pip3 install num2words==0.5.14
base_model: HuggingFaceTB/SmolVLM2-500M-Video-InstructLFM2-VL
Please uninstall causal-conv1d via pip3 uninstall -y causal-conv1d
base_model: LiquidAI/LFM2-VL-450MDataset Format
For multi-modal datasets, we adopt an extended chat_template format similar to OpenAI’s Message format.
- A message is a list of
roleandcontent. rolecan besystem,user,assistant, etc.contentis a list oftypeand (text,image,path,url,base64, oraudio).
Image
For backwards compatibility:
- If the dataset has a
imagesorimagecolumn oflist[Image], it will be appended to the firstcontentlist as{"type": "image", "image": ...}. However, if the content already has a{"type": "image"}but noimagekey, it will be set theimagekey. - If
contentis a string, it will be converted to a list withtypeastext.
For image loading, you can use the following keys within content alongside "type": "image":
"path": "/path/to/image.jpg""url": "https://example.com/image.jpg""base64": "...""image": PIL.Image
Audio
For audio loading, you can use the following keys within content alongside "type": "audio":
"path": "/path/to/audio.mp3""url": "https://example.com/audio.mp3""audio": np.ndarray
You may need to install librosa via pip3 install librosa==0.11.0.
Video
This is not well tested at the moment. We welcome contributors!
For video loading, you can use the following keys within content alongside "type": "video":
"path": "/path/to/video.mp4""url": "https://example.com/video.mp4""video": np.ndarray | list[PIL.Image.Image] | torch.Tensor(or list of the aforementioned)
Example
Here is an example of a multi-modal dataset:
[
{
"messages": [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "The image is a bee."}
]
}
]
}
]FAQ
PIL.UnidentifiedImageError: cannot identify image file ...
PIL could not retrieve the file at url using requests. Please check for typo. One alternative reason is that the request is blocked by the server.