loaders.tokenizer
loaders.tokenizer
Tokenizer loading functionality and associated utils
Functions
Name | Description |
---|---|
load_tokenizer | Load and configure the tokenizer based on the provided config. |
modify_tokenizer_files | Modify tokenizer files to replace added_tokens strings, save to output directory, |
load_tokenizer
loaders.tokenizer.load_tokenizer(cfg)
Load and configure the tokenizer based on the provided config.
modify_tokenizer_files
loaders.tokenizer.modify_tokenizer_files(
tokenizer_path,
token_mappings,
output_dir, )
Modify tokenizer files to replace added_tokens strings, save to output directory, and return the path to the modified tokenizer.
This only works with reserved tokens that were added to the tokenizer, not tokens already part of the vocab.
Parameters
Name | Type | Description | Default |
---|---|---|---|
tokenizer_path | str | Path or name of the original tokenizer | required |
token_mappings | dict[int, str] | Dict mapping {token_id (int): new_token_string} | required |
output_dir | str | Directory to save the modified tokenizer | required |
Returns
Name | Type | Description |
---|---|---|
str | Path to the modified tokenizer directory |
Ref: https://github.com/huggingface/transformers/issues/27974#issuecomment-1854188941