Quantization with torchao
Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the torchao library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).
We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.
Configuring Quantization in Axolotl
Quantization is configured using the quantization
key in your configuration file.
base_model: # The path to the model to quantize.
quantization:
weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.
output_dir: # The path to the output directory.
Once quantization is complete, your quantized model will be saved in the {output_dir}/quantized
directory.
You may also use the quantize
command to quantize a model which has been trained with QAT - you can do this by using the existing QAT configuration file which
you used to train the model:
# qat.yml
qat:
activation_dtype: int8
weight_dtype: int8
group_size: 256
quantize_embedding: true
output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
axolotl quantize qat.yml
This ensures that an identical quantization configuration is used to quantize the model as was used to train it.