🔣 tokeniser-py

📄 `README.md`

🔣 tokeniser-py

A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.

🚀 What This Library Offers

Tokeniser built on a vocabulary of 131,072 tokens
Two versions of vocab:
- 0.5B: Validation-only data
- 1B: Validation + Test data
Token vocab built via a custom algorithm — no Byte Pair Encoding (BPE)
Tokenisation logic includes:
- Token lookup from pre-generated token map
- Dynamic programming-based segmentation for out-of-vocab tokens
- One-hot encoding (NumPy or PyTorch)
- Visualisation utilities for tokens and token IDs
Lightweight JSON format for token maps & token count maps
Ready for integration into any LLM pre-tokenisation pipeline

Note: Files (chunked less than 2GB) are stored on Hugging Face instead of GitHub due to LFS file size constraints. On GitHub (files chunked below 100MB) are available.

📦 Installation

pip install tokeniser-py

🛠 Usage

from tokeniser import Tokeniser

t = Tokeniser()
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)

Use t.one_hot_tokens(token_ids) for NumPy-based one-hot encoding, or op='torch' for PyTorch.

📚 Data Sources

All token maps and token counts are generated from the SlimPajama dataset by Cerebras.

📁 Vocab Files

ordered_tokenizer_1b_val_test_data.json — Ordered tokens (1B data)
unordered_tokenizer_1b_val_test_data.json — Unordered tokens (1B)
count_tokenizer_1b_val_test_data.json — Token counts (1B)
(Similar structure for 0.5B val-only version)

📌 Design Philosophy

This tokeniser is built from scratch before learning existing algorithms like BPE. It is designed with the intent to understand, innovate, and compare with existing solutions from first principles.

Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.

🤝 Contributions

Feel free to contribute anything via GitHub.

📖 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
build/lib/tokeniser		build/lib/tokeniser
dist		dist
tokeniser		tokeniser
tokeniser_py.egg-info		tokeniser_py.egg-info
CHANGELOG.txt		CHANGELOG.txt
LICENCE.txt		LICENCE.txt
README.md		README.md
setup.py		setup.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 `README.md`

🔣 tokeniser-py

🚀 What This Library Offers

📦 Installation

🛠 Usage

📚 Data Sources

📁 Vocab Files

📌 Design Philosophy

🤝 Contributions

📖 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 README.md

🔣 tokeniser-py

🚀 What This Library Offers

📦 Installation

🛠 Usage

📚 Data Sources

📁 Vocab Files

📌 Design Philosophy

🤝 Contributions

📖 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

📄 `README.md`

Packages