Quantization with torchao

Quantization is a technique to lower the memory footprint of your model, potentially at the cost of accuracy or model performance. We support quantizing your model using the torchao library. Quantization is supported for both post-training quantization (PTQ) and quantization-aware training (QAT).

Note

We do not currently support quantization techniques such as GGUF/GPTQ,EXL2 at the moment.

Configuring Quantization in Axolotl

Quantization is configured using the quantization key in your configuration file.

base_model: # The path to the model to quantize.
quantization:
  weight_dtype: # Optional[str] = "int8". Fake quantization layout to use for weight quantization. Valid options are uintX for X in [1, 2, 3, 4, 5, 6, 7], or int4, or int8
  activation_dtype: # Optional[str] = "int8". Fake quantization layout to use for activation quantization. Valid options are "int4" and "int8"
  group_size: # Optional[int] = 32. The number of elements in each group for per-group fake quantization
  quantize_embedding: # Optional[bool] = False. Whether to quantize the embedding layer.

output_dir:  # The path to the output directory.

Once quantization is complete, your quantized model will be saved in the {output_dir}/quantized directory.

You may also use the quantize command to quantize a model which has been trained with QAT - you can do this by using the existing QAT configuration file which you used to train the model:

# qat.yml
qat:
  activation_dtype: int8
  weight_dtype: int8
  group_size: 256
  quantize_embedding: true

output_dir: # The path to the output directory used during training where the final checkpoint has been saved.
axolotl quantize qat.yml

This ensures that an identical quantization configuration is used to quantize the model as was used to train it.