Support overriding the model add_dummy_prefix setting
#908
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Sentencepiece makes sure that
decode(encode(x)) == x, and that is a very useful feature. One other property we may want to have isdecode(encode(x) + encode(y)) = "".join(x, y). This is especially useful when shuffling token streams. In practice, if--add_dummy_prefixwas set during training (as it is by default) this property does not hold, and it has been the cause of some woes. This PR proposes to tweak the SentencePieceProcessor object so that it provides the user a way to override theadd_dummy_prefixsetting burnt in the model. The Python API is suitably modified to expose the new capability.Implementation
In C++:
Normalizer::Normalizeto control whether a prefix space must be inserted.SentencePieceProcessorthat controls the prefix space logic. It is used when calling into the normalizer, and in theDecodefunction to find out if we need to drop a leading whitespace or not. The field is accessed and modified by two new getter/setter functions.In Python:
StringPieceProcessorflag are exposed.EncodeNoDummyPrefixandDecodeNoDummyPrefixto make one encode/decode call making sure that no dummy whitespace prefix is added or deleted.size_tannotations.The implementation may not suit your taste, in which case I would be happy to rework it.
Correctness
I was unable to find information on how to run tests, so I only made sure everything was compiling and checked that the new Python API worked as expected.