Qwen 3 Next

Qwen3-Next represents the next-generation foundation models optimized for extreme context length and large-scale parameter efficiency. The series introduces architectural innovations including Hybrid Attention (Gated DeltaNet + Gated Attention), High-Sparsity MoE with 1:50 activation ratio, and Multi-Token Prediction for enhanced performance and inference acceleration.

This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.

Getting started

Install Axolotl following the installation guide. You need to install from main as Qwen3-Next is only on nightly or use our latest Docker images.

Here is an example of how to install from main for pip:

# Ensure you have Pytorch installed (Pytorch 2.6.0 min)
git clone https://github.com/axolotl-ai-cloud/axolotl.git
cd axolotl

pip3 install packaging==26.0 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation -e '.[flash-attn]'

# Install CCE https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy
python scripts/cutcrossentropy_install.py | sh

Install Qwen3-Next transformers commit

pip3 uninstall -y transformers && pip3 install "git+https://github.com/huggingface/transformers.git@b9282355bea846b54ed850a066901496b19da654"

Install FLA for improved performance

pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.3.2

Run the finetuning example:

axolotl train examples/qwen3-next/qwen3-next-80b-a3b-qlora.yaml

This config uses about 45.62 GiB VRAM.

Let us know how it goes. Happy finetuning! 🚀

TIPS

For inference, you can experiment with temperature: 0.7, top_p: 0.8, top_k: 20, and min_p: 0.
You can run a full finetuning by removing the adapter: qlora and load_in_4bit: true from the config. See Multi-GPU section below.
Read more on how to load your own dataset at docs.
The dataset format follows the OpenAI Messages format as seen here.

Getting started

TIPS

Optimization Guides

Related Resources