🙂 Fast state-of-the-art tokenizers for Ruby
Add this line to your application’s Gemfile:
gem "tokenizers"Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")Encode
encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.ids
encoded.tokensDecode
tokenizer.decode(ids)Load a tokenizer from files
tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test