Mistral Small 3.1/3.2
This guide covers fine-tuning Mistral Small 3.1 and Mistral Small 3.2 with vision capabilities using Axolotl.
Prerequisites
Before starting, ensure you have:
- Installed Axolotl (see Installation docs)
Getting Started
Install the required vision lib:
bash pip install 'mistral-common[opencv]==1.8.5'Download the example dataset image:
wget https://huggingface.co/datasets/Nanobit/text-vision-2k-test/resolve/main/African_elephant.jpgRun the fine-tuning:
axolotl train examples/mistral/mistral-small/mistral-small-3.1-24B-lora.yml
This config uses about 29.4 GiB VRAM.
Dataset Format
The vision model requires multi-modal dataset format as documented here.
One exception is that, passing "image": PIL.Image is not supported. MistralTokenizer only supports path, url, and base64 for now.
Example:
{
"messages": [
{"role": "system", "content": [{ "type": "text", "text": "{SYSTEM_PROMPT}"}]},
{"role": "user", "content": [
{ "type": "text", "text": "What's in this image?"},
{"type": "image", "path": "path/to/image.jpg" }
]},
{"role": "assistant", "content": [{ "type": "text", "text": "..." }]},
],
}Limitations
- Sample Packing is not supported for multi-modality training currently.