Gemma 3n

Gemma-3n is a family of multimodal models from Google found on HuggingFace. This guide shows how to fine-tune it with Axolotl.

Getting started

Install Axolotl following the installation guide.

Here is an example of how to install from pip:

# Ensure you have Pytorch installed (Pytorch 2.6.0 min)
pip3 install packaging==26.0 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation 'axolotl[flash-attn]>=0.12.0'

In addition to Axolotl’s requirements, Gemma-3n requires:

pip3 install timm==1.0.17

# for loading audio data
pip3 install librosa==0.11.0

Download sample dataset files

# for text + vision + audio only
wget https://huggingface.co/datasets/Nanobit/text-vision-audio-2k-test/resolve/main/African_elephant.jpg
wget https://huggingface.co/datasets/Nanobit/text-vision-audio-2k-test/resolve/main/En-us-African_elephant.oga

Run the finetuning example:

# text only
axolotl train examples/gemma3n/gemma-3n-e2b-qlora.yml

# text + vision
axolotl train examples/gemma3n/gemma-3n-e2b-vision-qlora.yml

# text + vision + audio
axolotl train examples/gemma3n/gemma-3n-e2b-vision-audio-qlora.yml

Let us know how it goes. Happy finetuning! 🚀

WARNING: The loss and grad norm will be much higher than normal. We suspect this to be inherent to the model as of the moment. If anyone would like to submit a fix for this, we are happy to take a look.

TIPS

You can run a full finetuning by removing the adapter: qlora and load_in_4bit: true from the config.
Read more on how to load your own dataset at docs.
The text dataset format follows the OpenAI Messages format as seen here.
The multimodal dataset format follows the OpenAI multi-content Messages format as seen here.

Getting started

TIPS

Optimization Guides

Related Resources