simple-sentencepiece

A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.

Installation

pip install simple-sentencepiece

Usage

The usage is very similar to sentencepiece, it also has encode and decode interface.

from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")

# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])

# get vocab size
res = ssp.vocab_size()

# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])

# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])

Default models

Model Name	Description	Link
alphabet-33	`<blk>`,`<unk>`, `<sos>`, `<eos>`, `<pad>`, `'`, `▁` and 26 alphabets.	alphabet-33
librispeech-500	500 unigram pieces trained on Librispeech.	librispeech-500
librispeech-5000	5000 unigram pieces trained on Librispeech.	librispeech-5000
gigaspeech-500	500 unigram pieces trained on Gigaspeech.	gigaspeech-500
gigaspeech-2000	2000 unigram pieces trained on Gigaspeech.	gigaspeech-2000
gigaspeech-5000	5000 unigram pieces trained on Gigaspeech.	gigaspeech-5000
zh-en-3876	3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces.	zh-en-3876
zh-en-6876	6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces.	zh-en-6876
zh-en-8481	8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces.	zh-en-8481
zh-en-5776	3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces.	zh-en-5776
zh-en-8776	6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces.	zh-en-8776
zh-en-10381	8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces.	zh-en-10381
zh-en-yue-9761	8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces.	zh-en-yue-9761
zh-en-yue-11661	8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces.	zh-en-yue-11661
chn_jpn_yue_eng_ko_spectok.bpe	bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean	chn_jpn_yue_eng_ko_spectok.bpe

Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
cmake		cmake
ssentencepiece		ssentencepiece
.flake8		.flake8
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simple-sentencepiece

Installation

Usage

Default models

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

simple-sentencepiece

Installation

Usage

Default models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages