Kimi Linear

Kimi Linear is a MoE model (48B total, 3B active) by MoonshotAI using a hybrid linear attention architecture to achieve a 1M token context length. It uses Kimi Delta Attention (KDA), a refined version of Gated DeltaNet that reduces KV cache size by up to 75% and boosts decoding throughput by up to 6x for long contexts.

This guide shows how to fine-tune it with Axolotl with multi-turn conversations and proper masking.

Note: Axolotl uses experimental training code for Kimi Linear as their original modeling code is inference-only.

Getting started

Install Axolotl following the installation guide.
Install CCE via docs

Run the finetuning example:

axolotl train examples/kimi-linear/kimi-48b-lora.yaml

This config uses about 98.7GiB VRAM.

Let us know how it goes. Happy finetuning!

TIPS

Kimi Linear requires trust_remote_code: true.
You can run a full finetuning by removing the adapter: lora and load_in_8bit: true.
Read more on how to load your own dataset at docs
The dataset format follows the OpenAI Messages format as seen here

Optimization Guides

See 👉 docs.

Limitations

This is not yet compatible with MoE kernels from transformers v5.

Getting started

TIPS

Optimization Guides

Limitations

Related Resources