Imp Links: PyPI Library | PyPI Lite Library (tokeniser-py-lite) | Lite Library GitHub (tokeniser-py-lite) | Demo (HF Spaces) | Complete repo (unchunked) - HF | Complete repo (chunked) - GitHub | Imp Files Github
A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.
- Tokeniser built on a vocabulary of 131,072 tokens
- Two versions of vocab:
0.5B: Validation-only data1B: Validation + Test data
- Token vocab built via a custom algorithm — no Byte Pair Encoding (BPE)
- Tokenisation logic includes:
- Token lookup from pre-generated token map
- Dynamic programming-based segmentation for out-of-vocab tokens
- One-hot encoding (NumPy or PyTorch)
- Visualisation utilities for tokens and token IDs
- Lightweight JSON format for token maps & token count maps
- Ready for integration into any LLM pre-tokenisation pipeline
Note: Files (chunked less than 2GB) are stored on Hugging Face instead of GitHub due to LFS file size constraints. On GitHub (files chunked below 100MB) are available.
pip install tokeniser-pyfrom tokeniser import Tokeniser
t = Tokeniser()
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)Use t.one_hot_tokens(token_ids) for NumPy-based one-hot encoding, or op='torch' for PyTorch.
All token maps and token counts are generated from the SlimPajama dataset by Cerebras.
ordered_tokenizer_1b_val_test_data.json— Ordered tokens (1B data)unordered_tokenizer_1b_val_test_data.json— Unordered tokens (1B)count_tokenizer_1b_val_test_data.json— Token counts (1B)- (Similar structure for 0.5B val-only version)
This tokeniser is built from scratch before learning existing algorithms like BPE. It is designed with the intent to understand, innovate, and compare with existing solutions from first principles.
Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.
Feel free to contribute anything via GitHub.
MIT License