utils.mistral.mistral_tokenizer

utils.mistral.mistral_tokenizer

Wrapper for MistralTokenizer from mistral-common

Classes

Name Description
HFMistralTokenizer Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer

HFMistralTokenizer

utils.mistral.mistral_tokenizer.HFMistralTokenizer(name_or_path, **kwargs)

Wraps mistral_common.tokens.tokenizers.mistral.MistralTokenizer and exposes HuggingFace API for special tokens.

Attributes

Name Description
chat_template Chat template is not supported. Dummy method to satisfy HuggingFace API.

Methods

Name Description
apply_chat_template Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg
decode Decode token_ids into str.
from_pretrained Patched fn to pass name_or_path and remove extra kwargs.
save_pretrained Patches to remove save_jinja_files from being passed onwards.
apply_chat_template
utils.mistral.mistral_tokenizer.HFMistralTokenizer.apply_chat_template(
    conversation,
    chat_template=None,
    add_generation_prompt=False,
    **kwargs,
)

Patched fn to handle setting test mode, remove chat_template and add_generation_prompt kwarg

decode
utils.mistral.mistral_tokenizer.HFMistralTokenizer.decode(token_ids, **kwargs)

Decode token_ids into str.

This overrides upstream.decode to convert int to list[int]

from_pretrained
utils.mistral.mistral_tokenizer.HFMistralTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    *init_inputs,
    mode=ValidationMode.test,
    cache_dir=None,
    force_download=False,
    local_files_only=False,
    token=None,
    revision='main',
    model_max_length=VERY_LARGE_INTEGER,
    padding_side='left',
    truncation_side='right',
    model_input_names=None,
    clean_up_tokenization_spaces=False,
    **kwargs,
)

Patched fn to pass name_or_path and remove extra kwargs.

Instantiate a MistralCommonBackend from a predefined tokenizer.

Parameters
Name Type Description Default
pretrained_model_name_or_path str or os.PathLike Can be either: - A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. - A path to a directory containing the tokenizer config, for instance saved using the [MistralCommonBackend.tokenization_mistral_common.save_pretrained] method, e.g., ./my_model_directory/. required
mode ValidationMode, optional, defaults to ValidationMode.test Validation mode for the MistralTokenizer tokenizer. ValidationMode.test
cache_dir str or os.PathLike, optional Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used. None
force_download bool, optional, defaults to False Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist. False
token str or bool, optional The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running hf auth login (stored in ~/.huggingface). None
local_files_only bool, optional, defaults to False Whether or not to only rely on local files and not to attempt to download any files. False
revision str, optional, defaults to \"main\" The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git. 'main'
max_length int, optional Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. required
padding_side str, optional, defaults to \"left\" The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. 'left'
truncation_side str, optional, defaults to \"right\" The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’]. 'right'
model_input_names List\[string\], optional The list of inputs accepted by the forward pass of the model (like "token_type_ids" or "attention_mask"). Default value is picked from the class attribute of the same name. None
clean_up_tokenization_spaces bool, optional, defaults to False Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process. False
kwargs additional keyword arguments, optional Not supported by MistralCommonBackend.from_pretrained. Will raise an error if used. {}
save_pretrained
utils.mistral.mistral_tokenizer.HFMistralTokenizer.save_pretrained(
    *args,
    **kwargs,
)

Patches to remove save_jinja_files from being passed onwards.