utils.mistral.mistral_tokenizer
utils.mistral.mistral_tokenizer
Wrapper for MistralTokenizer from mistral-common
Classes
| Name | Description |
|---|---|
| HFMistralTokenizer | Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer |
HFMistralTokenizer
utils.mistral.mistral_tokenizer.HFMistralTokenizer(name_or_path, **kwargs)Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer and exposes HuggingFace API for special tokens.
Attributes
| Name | Description |
|---|---|
| chat_template | Chat template is not supported. Dummy method to satisfy HuggingFace API. |
Methods
| Name | Description |
|---|---|
| apply_chat_template | Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg |
| decode | Decode token_ids into str. |
| from_pretrained | Patched fn to pass name_or_path and remove extra kwargs. |
| save_pretrained | Patches to remove save_jinja_files from being passed onwards. |
apply_chat_template
utils.mistral.mistral_tokenizer.HFMistralTokenizer.apply_chat_template(
conversation,
chat_template=None,
add_generation_prompt=False,
**kwargs,
)Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg
decode
utils.mistral.mistral_tokenizer.HFMistralTokenizer.decode(token_ids, **kwargs)Decode token_ids into str.
This overrides upstream.decode to convert int to list[int]
from_pretrained
utils.mistral.mistral_tokenizer.HFMistralTokenizer.from_pretrained(
pretrained_model_name_or_path,
*init_inputs,
mode=ValidationMode.test,
cache_dir=None,
force_download=False,
local_files_only=False,
token=None,
revision='main',
model_max_length=VERY_LARGE_INTEGER,
padding_side='left',
truncation_side='right',
model_input_names=None,
clean_up_tokenization_spaces=False,
**kwargs,
)Patched fn to pass name_or_path and remove extra kwargs.
Instantiate a MistralCommonBackend from a predefined
tokenizer.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pretrained_model_name_or_path | str or os.PathLike |
Can be either: - A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. - A path to a directory containing the tokenizer config, for instance saved using the [MistralCommonBackend.tokenization_mistral_common.save_pretrained] method, e.g., ./my_model_directory/. |
required |
| mode | ValidationMode, optional, defaults to ValidationMode.test |
Validation mode for the MistralTokenizer tokenizer. |
ValidationMode.test |
| cache_dir | str or os.PathLike, optional |
Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used. | None |
| force_download | bool, optional, defaults to False |
Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist. | False |
| token | str or bool, optional |
The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface). |
None |
| local_files_only | bool, optional, defaults to False |
Whether or not to only rely on local files and not to attempt to download any files. | False |
| revision | str, optional, defaults to \"main\" |
The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git. |
'main' |
| max_length | int, optional |
Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. |
required |
| padding_side | str, optional, defaults to \"left\" |
The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. | 'left' |
| truncation_side | str, optional, defaults to \"right\" |
The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’]. | 'right' |
| model_input_names | List\[string\], optional |
The list of inputs accepted by the forward pass of the model (like "token_type_ids" or "attention_mask"). Default value is picked from the class attribute of the same name. |
None |
| clean_up_tokenization_spaces | bool, optional, defaults to False |
Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process. | False |
| kwargs | additional keyword arguments, optional | Not supported by MistralCommonBackend.from_pretrained. Will raise an error if used. |
{} |
save_pretrained
utils.mistral.mistral_tokenizer.HFMistralTokenizer.save_pretrained(
*args,
**kwargs,
)Patches to remove save_jinja_files from being passed onwards.