monkeypatch.moe_quant
monkeypatch.moe_quant
Loading-time quantization for MoE expert weights stored as 3D nn.Parameter tensors.
Classes
| Name | Description |
|---|---|
| Bnb8bitParametrization | Dequantizes int8 row-wise quantized data on access. |
Bnb8bitParametrization
monkeypatch.moe_quant.Bnb8bitParametrization(row_stats)Dequantizes int8 row-wise quantized data on access.
Methods
| Name | Description |
|---|---|
| forward | Flatten 3D+ to 2D for BnB’s dequant, then reshape back. |
forward
monkeypatch.moe_quant.Bnb8bitParametrization.forward(quantized_param)Flatten 3D+ to 2D for BnB’s dequant, then reshape back.
Functions
| Name | Description |
|---|---|
| get_moe_quantized_count | Return the number of expert parameters quantized during loading. |
| patch_moe_quantization_on_load | Patch transformers’ weight loading to quantize MoE expert params on-the-fly. |
| patch_peft_target_parameters_matching | Fix PEFT’s _inject_parameters for target_parameters on quantized MoE experts. |
| replace_parameter_8bit | Replace a module parameter with an 8-bit quantized version using parametrization. |
get_moe_quantized_count
monkeypatch.moe_quant.get_moe_quantized_count()Return the number of expert parameters quantized during loading.
patch_moe_quantization_on_load
monkeypatch.moe_quant.patch_moe_quantization_on_load(cfg)Patch transformers’ weight loading to quantize MoE expert params on-the-fly.
patch_peft_target_parameters_matching
monkeypatch.moe_quant.patch_peft_target_parameters_matching()Fix PEFT’s _inject_parameters for target_parameters on quantized MoE experts.
- Expands short suffixes to full module paths for parametrized modules.
- Iterates params in definition order (not alphabetical order) so saved adapters are compatible with standard PEFT, vLLM, etc.
- Skips ParametrizationList synthetic paths to prevent PEFT from mistakenly targeting quantized expert params via name-suffix matching.
replace_parameter_8bit
monkeypatch.moe_quant.replace_parameter_8bit(module, param_name)Replace a module parameter with an 8-bit quantized version using parametrization.