Skip to content

Conversation

@mpu
Copy link

@mpu mpu commented Aug 31, 2023

Motivation

Sentencepiece makes sure that decode(encode(x)) == x, and that is a very useful feature. One other property we may want to have is decode(encode(x) + encode(y)) = "".join(x, y). This is especially useful when shuffling token streams. In practice, if --add_dummy_prefix was set during training (as it is by default) this property does not hold, and it has been the cause of some woes. This PR proposes to tweak the SentencePieceProcessor object so that it provides the user a way to override the add_dummy_prefix setting burnt in the model. The Python API is suitably modified to expose the new capability.

Implementation

In C++:

  • We add a new argument to Normalizer::Normalize to control whether a prefix space must be inserted.
  • We add a field to SentencePieceProcessor that controls the prefix space logic. It is used when calling into the normalizer, and in the Decode function to find out if we need to drop a leading whitespace or not. The field is accessed and modified by two new getter/setter functions.
  • Whenever a model is loaded, we reset the field to the setting found in the normalizer spec protobuf.

In Python:

  • The getter and setter functions for the StringPieceProcessor flag are exposed.
  • We add two new functions EncodeNoDummyPrefix and DecodeNoDummyPrefix to make one encode/decode call making sure that no dummy whitespace prefix is added or deleted.
  • For the SWIG-generated C++ to compile locally, I had to add a couple size_t annotations.

The implementation may not suit your taste, in which case I would be happy to rework it.

Correctness

I was unable to find information on how to run tests, so I only made sure everything was compiling and checked that the new Python API worked as expected.

Screenshot 2023-08-31 at 12 57 30

@mpu
Copy link
Author

mpu commented Oct 10, 2023

ping!

@taku910
Copy link
Collaborator

taku910 commented Jan 16, 2024

Thank you for the contribution.

The amount of code to change one parameter of normalization seemed too large and not very general. We have added a method that allows you to change another parameter.

https://github.com/google/sentencepiece/blob/master/python/test/sentencepiece_test.py#L860

In any case, rewriting the immutable parameters on the fly is not recommended.

@taku910 taku910 closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants