-
Notifications
You must be signed in to change notification settings - Fork 13.7k
tokenizer : special token handling #3538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
b592c70
Rewrite special token handling from #1931
staviq fc634d8
shorten param name, add st verification by type
staviq 29e6b46
use offsets instead of copy by substr
staviq eac5f54
Merge branch 'master' into specialtokens
staviq f7b1205
formatting, remove copying iterator on delete
staviq 04ac055
Merge branch 'master' into HEAD
ggerganov 5c6b2be
llama : normalize code-style
ggerganov 0f1c569
swift fix
staviq 5974d61
print pfx/sfx if verb, main: split pfx input sfx
staviq 1c28116
dont add space when using special tokens
staviq fc82541
minor : comment + spacing
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
shorten param name, add st verification by type
- Loading branch information
commit fc634d87a8904b0844fbb71eb045d0f26d8bfd94
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like we have a redundancy here. We have
token_data.typewhich is supposed to tell us if a token is special or not. In which cases this wouldn't work?I guess we can have this piece of code here as a sanity check that the special tokens that we have read from the vocabulary are indeed special, but ideally AFAIU we shouldn't need this code in order to function correctly, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My approach is ment to be temporary ( or a fallback solution ), untill I know for sure special tokens will always be marked as such in the vocab.
@goerch wasn't certain about token types in bpe PR, so I opted for finding special tokens manually for now.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we get rid of #3502 first? This looks pretty basic (and terrible :) to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open for critique, but you have to clarify what you mean by "this" and "terrible" :) otherwise I'm not sure how to proceed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. We can probably combine both the tokens marked as special and those that are unmatchable by other tokens. And probably print a warning if there is a mismatch between the 2 sets.
Let's see if @goerch has anything else to add and then I would like to make a pass over the implementation to update the coding style to match better. After that we can merge
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure about our nomenclature, do you mean special tokens == added tokens? I'm mostly leaning towards the nomenclature used in HF's
tokenizer.jsonwhich containsadded_tokenswith theis_specialattribute. From @jploski I learned that these added tokens are probably very similar tosentence_pieceUSER_DEFINEDtokens.And one more question: do we agree that
vocab.json,merges.txt,added_tokens.jsonandspecial_tokens_map.jsonare probably older HF (or GPT2) serialization formats and we should find all this information intokenizer.jsonfor newer models (paging @apage43 too because he knows a lot about this stuff)? W.r.t. to these serialization formats we also seem to have some reports indicating that we don't support both of them equally well.Edit: just adding a hunch: is it possible that
tokenizer.jsonis the serialization format invented for the fast HF tokenizers?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my, you better grab a chair because you are in for an adventure with this question :)
A really condensed version of what I learned so far:
I couldn't find any clear definition or explanation, but from what I found, there seem to be only one way all pieces of this puzzle fall together to form a clear picture.
A vocab, naturally forms a tree like structure ( several trees actually, there is no common root )
Tokenizer does it's thing, by grabbing small portions of the input text, and trying to steal neighboring portions to form a valid token, rinse and repeat, until nothing can be borrowed from either left or right to merge with and form a valid token.
So basically, tokenizer climbs that tokens tree down from single bytes to full tokens, stepping only on valid branches ( tokens )
Applying this idea backwards, by checking the entire vocab token by token, trying to split its text representation in two and checking if it still forms two valid tokens, you can clearly see some of the tokens in the vocab are not part of that tree family, and will never be matched by the tokenizer.
Which happens to match perfectly with normal tokens being marked as
LLAMA_TOKEN_TYPE_NORMALin the vocab, and tokens not being part of the tokens tree family being marked as whatever type being notLLAMA_TOKEN_TYPE_NORMALTherefore, I'm using the term special token, refering to tokens which are "hidden" from the user and un-matchable by tokenizer.
From the practical point of view, that subset of "special" tokens contains exactly what one would expect,
<s>,</s>,<unk>and in case of mistral openorca<|im_start|>,<|im_end|>, AKA tokens which are used to control the flow of inference. ( EDIT: plus byte tokens<0xXX>)And so from the point of view of actual inference, the tokenizer has to be extended the same way, independently of the actual token type, because only normal vs not normal token type matter at that point.
Additional distinction between USER_DEFINED, CONTROL etc can and has to be done the same way, with a tokenizer "pre-matcher", at which point per token type decisions or actions are trivial to implement if needed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time! My mental image is slightly different: I believe to understand the initial training of the tokenizer leads to an initial vocabulary and later model refinement tries to extend this basic vocabulary with added tokens without retraining the tokenizer on the original dataset. These added tokens should be intercepted before the original tokenizer interpretation. In my mental image special tokens somehow describe the effect these added tokens have on detokenization, i.e. should they be displayed or not for example. But I might be completely off here.
Edit:
Here I think of
<s>,</s>,<unk>as original CONTROL tokens and<|im_start|>,<|im_end|>as USER_DEFINED tokens which shouldn't be visible after detokinization.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply do not know if there is any significance to added "normal" tokens ( added in fine tuning etc )
From what I understand, in tokenizer specifically ( ignoring detokenizer ), normal tokens added the way you say, would simply be tokenized one step further into a longer token instead of being represented by a sequence of smaller tokens.
If such added tokens, are not integrated with the tree structure of the vocab, user would not be able to use them
If however those added tokens are properly coherent with the vocab, tokenizer would match them already, as it is
Which to me sort of seems to defy the purpose of adding "normal" tokens this way
I might be wrong about this part, but that still means a "tokenizer preprocessor" of some sort is needed, and this PR fits that role
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they have to be processed before the core tokenizer routine in something like
prefix_match(only exception are the already defined core tokenizer CONTROL tokens).mptadded tokens for example (fromtokenizer.json):