loaders.tokenizer

loaders.tokenizer

Tokenizer loading functionality and associated utils

Functions

Name Description
load_tokenizer Load and configure the tokenizer based on the provided config.
modify_tokenizer_files Modify tokenizer files to replace added_tokens strings, save to output directory,

load_tokenizer

loaders.tokenizer.load_tokenizer(cfg)

Load and configure the tokenizer based on the provided config.

modify_tokenizer_files

loaders.tokenizer.modify_tokenizer_files(
    tokenizer_path,
    token_mappings,
    output_dir,
)

Modify tokenizer files to replace added_tokens strings, save to output directory, and return the path to the modified tokenizer.

This only works with reserved tokens that were added to the tokenizer, not tokens already part of the vocab.

Parameters

Name Type Description Default
tokenizer_path str Path or name of the original tokenizer required
token_mappings dict[int, str] Dict mapping {token_id (int): new_token_string} required
output_dir str Directory to save the modified tokenizer required

Returns

Name Type Description
str Path to the modified tokenizer directory

Ref: https://github.com/huggingface/transformers/issues/27974#issuecomment-1854188941