utils.mistral.mistral_tokenizer

utils.mistral.mistral_tokenizer

Wrapper for MistralTokenizer from mistral-common

Classes

Name	Description
HFMistralTokenizer	Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer

HFMistralTokenizer

utils.mistral.mistral_tokenizer.HFMistralTokenizer(name_or_path, **kwargs)

Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer and exposes HuggingFace API for special tokens.

Attributes

Name	Description
chat_template	Chat template is not supported. Dummy method to satisfy HuggingFace API.

Methods

Name	Description
apply_chat_template	Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg
decode	Decode token_ids into str.
from_pretrained	Patched fn to pass `name_or_path` and remove extra kwargs.
save_pretrained	Patches to remove save_jinja_files from being passed onwards.

apply_chat_template

utils.mistral.mistral_tokenizer.HFMistralTokenizer.apply_chat_template(
    conversation,
    chat_template=None,
    add_generation_prompt=False,
    **kwargs,
)

Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg

decode

utils.mistral.mistral_tokenizer.HFMistralTokenizer.decode(token_ids, **kwargs)

Decode token_ids into str.

This overrides upstream.decode to convert int to list[int]

from_pretrained

utils.mistral.mistral_tokenizer.HFMistralTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    *init_inputs,
    mode=ValidationMode.test,
    cache_dir=None,
    force_download=False,
    local_files_only=False,
    token=None,
    revision='main',
    model_max_length=VERY_LARGE_INTEGER,
    padding_side='left',
    truncation_side='right',
    model_input_names=None,
    clean_up_tokenization_spaces=False,
    **kwargs,
)

Patched fn to pass name_or_path and remove extra kwargs.

Instantiate a MistralCommonBackend from a predefined tokenizer.

Parameters

Name	Type	Description	Default
pretrained_model_name_or_path	`str` or `os.PathLike`	Can be either: - A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. - A path to a directory containing the tokenizer config, for instance saved using the [`MistralCommonBackend.tokenization_mistral_common.save_pretrained`] method, e.g., `./my_model_directory/`.	required
mode	`ValidationMode`, optional, defaults to `ValidationMode.test`	Validation mode for the `MistralTokenizer` tokenizer.	`ValidationMode.test`
cache_dir	`str` or `os.PathLike`, optional	Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used.	`None`
force_download	`bool`, optional, defaults to `False`	Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist.	`False`
token	`str` or bool, optional	The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `hf auth login` (stored in `~/.huggingface`).	`None`
local_files_only	`bool`, optional, defaults to `False`	Whether or not to only rely on local files and not to attempt to download any files.	`False`
revision	`str`, optional, defaults to `\"main\"`	The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git.	`'main'`
max_length	`int`, optional	Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to `None`, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.	required
padding_side	`str`, optional, defaults to `\"left\"`	The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.	`'left'`
truncation_side	`str`, optional, defaults to `\"right\"`	The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’].	`'right'`
model_input_names	`List\[string\]`, optional	The list of inputs accepted by the forward pass of the model (like `"token_type_ids"` or `"attention_mask"`). Default value is picked from the class attribute of the same name.	`None`
clean_up_tokenization_spaces	`bool`, optional, defaults to `False`	Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process.	`False`
kwargs	additional keyword arguments, optional	Not supported by `MistralCommonBackend.from_pretrained`. Will raise an error if used.	`{}`

save_pretrained

utils.mistral.mistral_tokenizer.HFMistralTokenizer.save_pretrained(
    *args,
    **kwargs,
)

Patches to remove save_jinja_files from being passed onwards.