This repository contains the code for mSTR, a two-stage multilingual Scene Text Recognition (STR) system designed to scale across scripts without brittle script routing or an exploding per-script decoder.
At a high level:
- Stage-1 (mCLIP): trains a CLIP-style character image ↔ character text dual-encoder on synthetic character crops to learn a shared multilingual character embedding manifold. Training is primarily contrastive (InfoNCE) and is optionally regularized with ArcFace and Subtype supervision to reduce cross-script overlap and near-neighbor confusions.
- Stage-2 (mSequenceSpotter + mRefiner):
- mSequenceSpotter: predicts a sequence of character embeddings for a word image and retrieves symbols by cosine similarity against a frozen prototype bank (from mCLIP).
- mRefiner: performs lightweight autoregressive correction using (i) prefix memory from Spotter top-K hypotheses and (ii) a multi-scale visual memory to preserve spatial evidence for subtle marks (diacritics, conjunct cues, etc.).
The repo is organized to support:
- training on synthetic + fine-tuning on real data,
- 1-language / 3-language / 6-language joint settings,
- evaluation with WRR/CRR and script detection accuracy.
- Repository layout
- Setup
- Data
- Models
- Training
- Evaluation
- Reproducing paper settings
- Adding a new language/script
- Notes for reviewers
- Citation
- License
The exact filenames may vary slightly across branches; use this section as a map of intent.
Typical structure:
.
├── configs/ # YAML/JSON configs for training/eval
│ ├── train_config.* # stage-wise training configs
│ ├── eval_config.* # evaluation configs (datasets, ckpts, decoding)
│ └── langs/ # per-language charset JSONs (letters/diacritics/digits/punct/common)
├── model/
│ ├── mCLIP/ # dual encoder (image encoder + text encoder)
│ ├── mSeqSpotter.py # word-image embedding decoder + retrieval head
│ └── mRefiner.py # refinement decoder using prefix + visual memory
├── util/ # helpers: normalization, encoding, prototype bank, metrics, logging
├── outputs/ # logs, checkpoints, predictions, visualizations (created at runtime)
└── README.md
- Python 3.9+ (recommended)
- PyTorch (CUDA recommended)
- Common packages:
numpy,opencv-python,Pillow,tqdm,pyyaml, etc. - LMDB support:
lmdb
# (Recommended) create env
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# install deps
pip install -r requirements.txtTraining is GPU-heavy. The experiments were designed to keep the Stage-1 encoder relatively light to enable large batches for contrastive learning.
Reference training setup used:
- GPU: GeForce RTX 2080 Ti (DataParallel used where applicable)
- Optimizer: Adam
- LR schedule: linear warmup + cosine decay
- Batch sizes (reference): Stage-1 256, Stage-2 512 (adjust in configs based on VRAM)
Train/val in LMDB format
Training and validation are expected in LMDB format.
Recommended convention (per language):
LMDB/
bn/
train_lmdb/ # LMDB directory
val_lmdb/
en/
train_lmdb/
val_lmdb/
...
The evaluation pipeline expects a per-language folder layout with an accompanying manifest file detailing each image as:
Example (as used in internal runs):
"id": "1a4ada6d90f6dc00014085a496bdd4d11aaa4d45",
"path_rel": "en/cute80/images/cute80_english_47.png",
"filename": "cute80_english_47.png",
"gt_text": "BMW",
"lang": "en",
"dataset": "cute80"Some datasets require normalization; this repo standardizes text by NFKD normalization and whitespace removal to keep evaluation consistent across scripts.
Stage-1 mCLIP is trained on synthetic character images rendered with varied fonts and augmentations.
Typical generation recipe:
-
per language: ~585K synthetic character samples and ~1.5M synthetic word samples for pretraining
-
fonts: multiple fonts with coverage checks per script
-
augmentations: ~9–10 augmentations (blur, noise, perspective, illumination, background blend, compression, etc.)
-
prompts/labels: a prompted format such as:
<lang=en> <type=0> <char=A><lang=hi> <type=1> <char=ा>(example subtype for diacritic)
For generating synthetic character and word crops (fonts, rendering, and augmentations), we use the external generator repo:
Please follow that repository’s instructions to produce the synthetic character and word dataset(s), then point this repo’s training config to the generated output directories.
Charsets are typically stored as JSON per language with fields like:
letters,diacritics,digits,punctuation,common- optional script-specific markers (e.g., virama / nukta where relevant)
These are used for:
- building the prototype bank (mCLIP text embeddings),
- enforcing language-aware constraints in decoding (optional),
- subtype supervision (letter vs diacritic vs punctuation/digit).
A CLIP-style dual encoder trained at character level:
- Image encoder: ResNet-like backbone (kept light to support large batches)
- Text encoder: Transformer encoder over prompted tokens (QuickGELU/LayerNorm typical)
- Embedding space: L2-normalized; similarity is cosine
Losses commonly used:
- InfoNCE (symmetric image↔text contrastive)
- ArcFace (optional): angular margin separation per class (or per language-aware class sets)
- Subtype loss (optional): encourages separation between letters/diacritics/digits/punctuation
Symmetric contrastive objective: $$ \begin{aligned} \text{logits}{i2t} &= s \cdot (E{img} \cdot E_{txt}^T) \ \text{logits}{t2i} &= \text{logits}{i2t}^T \ \mathcal{L} &= \text{CE}(\text{logits}{i2t}, y{gt}) + \text{CE}(\text{logits}{t2i}, y{gt}) \end{aligned} $$
The outcome of Stage-1 is a frozen prototype bank: text embeddings for every character token (across all languages).
A word-image recognizer that predicts a sequence of embedding vectors (not class logits directly).
Key idea:
- decode per-step embedding
e_t - compute cosine similarity to frozen prototypes
P(shape: [num_chars, D]) - retrieve predicted token by
argmax(sim(e_t, P))
Important note:
- decoding uses cosine similarity scores directly; no exp/softmax is required to pick the token (argmax over similarities).
Also supports:
- top-K hypotheses per step (used as a prefix memory input to the refiner)
A lightweight autoregressive refiner that corrects Spotter outputs by fusing:
- Prefix memory: constructed from Spotter top-K token hypotheses
- Visual memory: multi-scale feature memory (e.g., FPN-like) to preserve spatial evidence for diacritics and fine marks
This stage is trained after the Spotter. Since it depends on Spotter top-K behavior, training commonly uses:
- either real Spotter outputs, or
- a simulator that mimics Spotter’s error distribution during refiner training (to decouple training stages cleanly).
Entry-point scripts vary; check specific scripts and
configs/in your branch.
Typical flow:
- Generate synthetic character data (per language)
- Train mCLIP with InfoNCE (+ optional ArcFace + subtype)
Example:
python main.pyOutput:
- checkpoint(s) under
outputs/exp_name/...
You have to gnerate prototype bank (text embeddings for all characters)
Uses word images + GT sequences, but predicts embeddings and retrieves via prototype bank.
python train_mSeqSpotter.pyPost Training and finetuning on real dataset, generate the manifest files for evaluation of mSeqSpotter and mRefiner
Trained after Spotter; uses Spotter top-K and visual memory.
python train_mRefiner.pyEvaluate WRR/CRR per dataset per language (and optionally script accuracy if enabled):
python eval_mRefiner.py Outputs typically include:
- per-dataset metrics (WRR/CRR)
- per-language breakdowns
Reference training strategy used in our runs:
- Pretrain on synthetic: ~1.5M samples per language
- Finetune on real: ~100K samples per language
- Batch sizes: Stage-1 256, Stage-2 512 (GPU dependent)
- Optimizer: Adam
- Schedule: linear warmup + cosine decay
- Normalization: all texts are NFKD normalized and whitespace removed
For multi-stage training feasibility under VRAM limits:
- Stage-2 is trained in two parts: Spotter → Refiner
- Refiner training may use a Spotter-output simulator; inference uses real Spotter top-K.
High-level checklist:
-
Add charset JSON:
configs/charset/<lang>.json -
Ensure font coverage:
- add fonts and run coverage check (if provided)
-
Generate synthetic characters for the new script
-
Retrain Stage-1 to include new charset (prototype bank must include new chars)
-
Retrain Stage-2 in joint setting (or fine-tune if supported)
If you use ArcFace with language-specific margins, update margins carefully to avoid over-separating scripts in the shared space.
- This repo is meant to be self-contained for training/evaluation (synthetic + real data adapters), but datasets are not redistributed.
- Key design choice: word recognition is performed in a shared embedding space and retrieves from a frozen multilingual prototype bank, avoiding script-specific classifier heads.
If you use this code, please cite:
@inproceedings{mstr2026,
title = {mSTR: Multilingual Scene Text Recognition in a Shared Character Embedding Space},
author = {Harsh Lunia, Ajoy Mondal, C V Jawahar},
year = {2026}
}For questions/issues:
-
Open a GitHub Issue with:
- command/config used
- checkpoint name
- minimal log snippet
- dataset + language