loaders.tokenizer

loaders.tokenizer

Tokenizer loading functionality and associated utils

Functions

Name	Description
load_tokenizer	Load and configure the tokenizer based on the provided config.
modify_tokenizer_files	Modify tokenizer files to replace added_tokens strings, save to output directory,

loaders.tokenizer.load_tokenizer(cfg)

Load and configure the tokenizer based on the provided config.

loaders.tokenizer.modify_tokenizer_files(
    tokenizer_path,
    token_mappings,
    output_dir,
)

Modify tokenizer files to replace added_tokens strings, save to output directory, and return the path to the modified tokenizer.

This only works with reserved tokens that were added to the tokenizer, not tokens already part of the vocab.

Name	Type	Description	Default
tokenizer_path	str	Path or name of the original tokenizer	required
token_mappings	dict[int, str]	Dict mapping {token_id (int): new_token_string}	required
output_dir	str	Directory to save the modified tokenizer	required

Name	Type	Description
	str	Path to the modified tokenizer directory

Ref: https://github.com/huggingface/transformers/issues/27974#issuecomment-1854188941