A self-contained Qwen3 ByteLevel-BPE tokenizer implementation for the FunASR-Nano speech recognition model.
The FunASR-Nano tokenizer (FunASRNanoTokenizer) is a standalone implementation that:
- Has no dependency on tokenizers-cpp or HuggingFace tokenizers library
- Implements the Qwen3 ByteLevel-BPE tokenization algorithm
- Supports AddedTokens via Trie-based longest-match
- Provides streaming decoding for real-time ASR output with callbacks
- Works on multiple platforms: Desktop (Windows/Linux/macOS), Android, HarmonyOS
The tokenizer now supports streaming text output with callbacks for real-time ASR applications:
#include "sherpa-onnx/csrc/funasr-nano-tokenizer.h"
// Callback receives text chunks as tokens are decoded
auto callback = [](const std::string &text, int64_t token_id, bool is_final) -> bool {
std::cout << text << std::flush; // Stream output in real-time
return true; // Return false to stop early
};
// Decode with streaming callback
tokenizer.DecodeWithCallback(token_ids, callback);
// Or use StreamingDecodeState for manual control
StreamingDecodeState state;
for (int64_t token_id : token_ids) {
std::string text = tokenizer.DecodeTokenStreaming(token_id, &state);
if (!text.empty()) {
std::cout << text << std::flush;
}
}
// Flush any remaining bytes
std::cout << state.FlushPendingBytes();The tokenizer requires three files in a directory:
tokenizer_dir/
├── tokenizer.json # Contains added_tokens definitions
├── vocab.json # Token-to-ID mapping {"token": id, ...}
└── merges.txt # BPE merge rules (one pair per line)
For FunASR-Nano, these files are located in the Qwen3-0.6B subdirectory of the model package.
FunASRNanoTokenizer
├── Public Methods
│ ├── Encode(text) -> vector<int64_t>
│ ├── Decode(token_ids) -> string
│ ├── DecodeTokenStreaming(token_id, state) -> string
│ ├── DecodeWithCallback(token_ids, callback) -> string
│ ├── GetTokenStringStreaming(token_id, pending_bytes) -> string
│ ├── IsSpecialToken(token_id) -> bool
│ ├── IdToToken(token_id) -> string
│ ├── GetEosTokenId() -> int64_t
│ ├── GetPadTokenId() -> int64_t
│ ├── GetImEndTokenId() -> int64_t
│ └── GetImStartTokenId() -> int64_t
│
├── Callback Type
│ └── StreamingTokenCallback = function<bool(text, token_id, is_final)>
│
├── State Class
│ └── StreamingDecodeState
│ ├── PendingBytes() -> string&
│ ├── TokenCount() -> int32_t
│ ├── IncrementTokenCount()
│ ├── Reset()
│ ├── HasPendingBytes() -> bool
│ └── FlushPendingBytes() -> string
│
├── Data Structures
│ ├── AddedToken {content, id, single_word, lstrip, rstrip, normalized, special}
│ └── TrieNode {next: map<byte, node_index>, token_index}
│
└── Internal State
├── token2id_: map<string, int32_t> # Token to ID mapping
├── id2token_: vector<string> # ID to token mapping
├── merges_rank_: map<string, int32_t> # BPE merge priorities
├── bpe_cache_: map<string, vector> # BPE encoding cache
├── byte_to_unicode_[256]: string # ByteLevel encoding table
├── unicode_to_byte_: map<string, uint8_t># ByteLevel decoding table
├── added_tokens_: vector<AddedToken> # Special tokens
├── special_ids_: set<int32_t> # Set of special token IDs
└── trie_: vector<TrieNode> # Trie for added token matching
Input Text
│
▼
┌─────────────────────────────────────┐
│ 1. AddedToken Matching (Trie) │ Match special tokens like <|im_start|>
│ - Longest-match using Trie │
│ - Check single_word boundary │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Pre-tokenization (Qwen3 Pattern) │ Split text into words/subwords
│ - Handle contractions ('s, 't) │
│ - Group letters, numbers, punct │
│ - Handle whitespace/newlines │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. ByteLevel Encoding │ Map bytes to unicode characters
│ - Each byte → unicode char │
│ - Handles all 256 byte values │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. BPE Merging │ Apply byte-pair encoding
│ - Split into UTF-8 characters │
│ - Iteratively merge best pairs │
│ - Uses merge rank from merges.txt│
│ - Results cached for performance │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 5. Token ID Lookup │ Convert tokens to IDs
│ - Look up each BPE token │
│ - Skip unknown tokens │
└─────────────────────────────────────┘
│
▼
Output: vector<int64_t>
Input: vector<int64_t>
│
▼
┌─────────────────────────────────────┐
│ 1. Token Lookup │ Convert IDs to tokens
│ - Skip special token IDs │
│ - id2token_ lookup │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 2. Concatenate Tokens │ Join all token strings
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 3. ByteLevel Decoding │ Map unicode back to bytes
│ - unicode_to_byte_ lookup │
│ - Reconstruct original bytes │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ 4. Post-processing │ Clean up output
│ - Remove special token strings │
│ - Trim whitespace │
└─────────────────────────────────────┘
│
▼
Output: string (UTF-8)
For real-time ASR, GetTokenStringStreaming() handles partial UTF-8 sequences:
std::string pending_bytes; // Maintains state between calls
// For each token from ASR model:
std::string text = tokenizer.GetTokenStringStreaming(token_id, &pending_bytes);
// Returns valid UTF-8 prefix, keeps incomplete bytes in pending_bytesKey Features:
- Accumulates bytes until valid UTF-8 sequence is formed
- Handles incomplete multi-byte characters across token boundaries
- Replaces invalid sequences with U+FFFD (replacement character)
The ByteLevel algorithm maps all 256 possible byte values to printable Unicode characters:
Byte Range Unicode Mapping
─────────────────────────────────────
33-126 (ASCII) Same codepoint (printable ASCII)
161-172 Same codepoint (Latin-1 printable)
174-255 Same codepoint (Latin-1 printable)
0-32, 127-160, 256 + offset (mapped to higher Unicode)
173
This ensures all byte sequences can be represented as valid Unicode strings in the vocabulary.
The tokenizer implements a manual approximation of the Qwen3 pre-tokenizer regex pattern, avoiding std::regex due to missing \p{L}/\p{N} support in standard libraries.
Supported Patterns:
- English contractions:
's,'t,'m,'d,'re,'ve,'ll - Letter sequences (including CJK, Japanese, Korean)
- Individual digits
- Punctuation sequences
- Whitespace handling (space, tab, newlines)
Unicode Support:
- CJK Unified Ideographs (U+4E00-U+9FFF)
- CJK Extension A (U+3400-U+4DBF)
- Hiragana/Katakana (U+3040-U+30FF)
- Hangul Syllables (U+AC00-U+D7AF)
- Latin Extended (U+00C0-U+02AF)
- Fullwidth digits (U+FF10-U+FF19)
The tokenizer recognizes these special tokens from Qwen3:
| Token | Default ID | Purpose |
|---|---|---|
<|im_end|> |
151645 | End of message marker |
<|im_start|> |
(from vocab) | Start of message marker |
<|endoftext|> |
(from vocab) | End of text (EOS) |
<|pad|> |
(from vocab) | Padding token |
These tokens are:
- Matched first during encoding (via Trie)
- Excluded from decoded output
- Tracked in
special_ids_set
Special tokens are stored in a Trie data structure for efficient longest-match:
Root
├─ '<' ─ '|' ─ 'i' ─ 'm' ─ '_' ─ 'e' ─ 'n' ─ 'd' ─ '|' ─ '>' [token_index=0]
│ └─ 's' ─ 't' ─ 'a' ─ 'r' ─ 't' ─ '|' ─ '>' [token_index=1]
└─ ...
Properties:
- Byte-level matching (not character-level)
- Returns longest match found
- Supports
single_wordconstraint (word boundary checking)
FunASRNanoTokenizer tokenizer("/path/to/tokenizer_dir");FunASRNanoTokenizer tokenizer(asset_manager, "tokenizer_dir");FunASRNanoTokenizer tokenizer(native_resource_manager, "tokenizer_dir");#include "sherpa-onnx/csrc/funasr-nano-tokenizer.h"
// Initialize
sherpa_onnx::FunASRNanoTokenizer tokenizer("./Qwen3-0.6B");
// Encode text to token IDs
std::string text = "Hello, 你好!";
std::vector<int64_t> ids = tokenizer.Encode(text);
// Decode token IDs back to text
std::string decoded = tokenizer.Decode(ids);
// Streaming decode (for real-time ASR)
std::string pending;
for (int64_t id : ids) {
std::string chunk = tokenizer.GetTokenStringStreaming(id, &pending);
std::cout << chunk; // Output as soon as valid UTF-8 is available
}-
BPE Cache: Computed BPE results are cached in
bpe_cache_to avoid recomputation for repeated words. -
Trie Matching: O(m) complexity for matching added tokens where m is the token length.
-
Precomputed Tables:
byte_to_unicode_andunicode_to_byte_are built once during initialization. -
Reserved Memory: Vectors use
reserve()to minimize reallocations.
A lightweight JSON parser that handles:
- String parsing with escape sequences (
\n,\uXXXX, surrogate pairs) - Integer parsing
- Boolean parsing
- Object/array skipping
Utf8Next(): Parse next UTF-8 characterAppendUtf8(): Encode codepoint to UTF-8ConsumeValidUtf8Prefix(): Extract valid UTF-8 prefix for streaming
{
"token1": 0,
"token2": 1,
"Ġhello": 1234,
...
}#version: 0.2
Ġ t
Ġ a
Ġt he
...
Each line: left_token right_token (space-separated, rank = line number)
{
"added_tokens": [
{
"id": 151643,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
...
]
}The tokenizer uses SHERPA_ONNX_LOGE for error logging and SHERPA_ONNX_EXIT(-1) for fatal errors:
- Missing required files
- Parse failures
- Uninitialized tokenizer access
The tokenizer is not thread-safe due to the mutable bpe_cache_. For multi-threaded use:
- Create separate tokenizer instances per thread, OR
- Add external synchronization around
Encode()calls
The StreamingDecodeState class manages state for streaming token decoding:
class StreamingDecodeState {
public:
StreamingDecodeState() = default;
// Access pending bytes buffer
std::string &PendingBytes();
const std::string &PendingBytes() const;
// Get count of tokens decoded so far
int32_t TokenCount() const;
// Increment token counter
void IncrementTokenCount();
// Reset state for new sequence
void Reset();
// Check if there are pending bytes
bool HasPendingBytes() const;
// Flush remaining bytes as replacement characters (U+FFFD)
std::string FlushPendingBytes();
private:
std::string pending_bytes_;
int32_t token_count_ = 0;
};StreamingDecodeState state;
// Process each token
for (int64_t token_id : generated_tokens) {
std::string text = tokenizer.DecodeTokenStreaming(token_id, &state);
if (!text.empty()) {
std::cout << text << std::flush;
}
}
// Flush any remaining incomplete UTF-8 sequences
std::string remaining = state.FlushPendingBytes();
if (!remaining.empty()) {
std::cout << remaining;
}The callback type used for streaming decoding:
using StreamingTokenCallback =
std::function<bool(const std::string &text, int64_t token_id, bool is_final)>;| Parameter | Type | Description |
|---|---|---|
text |
const std::string& |
Decoded text chunk from this token |
token_id |
int64_t |
The token ID that was decoded |
is_final |
bool |
True if this is the last token in the sequence |
- Return
trueto continue decoding - Return
falseto stop decoding early
auto callback = [](const std::string &text, int64_t token_id, bool is_final) -> bool {
std::cout << text << std::flush;
if (is_final) {
std::cout << std::endl;
}
// Stop if we encounter a specific token
if (token_id == STOP_TOKEN_ID) {
return false; // Stop early
}
return true; // Continue
};
std::string full_text = tokenizer.DecodeWithCallback(token_ids, callback);Download the FunASR-Nano model package:
curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-funasr-nano-int8-2025-12-30.tar.bz2
tar xvf sherpa-onnx-funasr-nano-int8-2025-12-30.tar.bz2Tokenizer files are in: sherpa-onnx-funasr-nano-int8-2025-12-30/Qwen3-0.6B/