mSTR: Scalable Multilingual Scene Text Recognition with Positive Language-Scaling

This repository contains the code for mSTR, a two-stage multilingual Scene Text Recognition (STR) system designed to scale across scripts without brittle script routing or an exploding per-script decoder.

At a high level:

Stage-1 (mCLIP): trains a CLIP-style character image ↔ character text dual-encoder on synthetic character crops to learn a shared multilingual character embedding manifold. Training is primarily contrastive (InfoNCE) and is optionally regularized with ArcFace and Subtype supervision to reduce cross-script overlap and near-neighbor confusions.
Stage-2 (mSequenceSpotter + mRefiner):
- mSequenceSpotter: predicts a sequence of character embeddings for a word image and retrieves symbols by cosine similarity against a frozen prototype bank (from mCLIP).
- mRefiner: performs lightweight autoregressive correction using (i) prefix memory from Spotter top-K hypotheses and (ii) a multi-scale visual memory to preserve spatial evidence for subtle marks (diacritics, conjunct cues, etc.).

The repo is organized to support:

training on synthetic + fine-tuning on real data,
1-language / 3-language / 6-language joint settings,
evaluation with WRR/CRR and script detection accuracy.

Repository layout

The exact filenames may vary slightly across branches; use this section as a map of intent.

Typical structure:

.
├── configs/                 # YAML/JSON configs for training/eval
│   ├── train_config.*       # stage-wise training configs
│   ├── eval_config.*        # evaluation configs (datasets, ckpts, decoding)
│   └── langs/               # per-language charset JSONs (letters/diacritics/digits/punct/common)
├── model/
│   ├── mCLIP/               # dual encoder (image encoder + text encoder)
│   ├── mSeqSpotter.py       # word-image embedding decoder + retrieval head
│   └── mRefiner.py          # refinement decoder using prefix + visual memory
├── util/                    # helpers: normalization, encoding, prototype bank, metrics, logging
├── outputs/                 # logs, checkpoints, predictions, visualizations (created at runtime)
└── README.md

Setup

Requirements

Python 3.9+ (recommended)
PyTorch (CUDA recommended)
Common packages: numpy, opencv-python, Pillow, tqdm, pyyaml, etc.
LMDB support: lmdb

Install

# (Recommended) create env
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# install deps
pip install -r requirements.txt

Hardware

Training is GPU-heavy. The experiments were designed to keep the Stage-1 encoder relatively light to enable large batches for contrastive learning.

Reference training setup used:

GPU: GeForce RTX 2080 Ti (DataParallel used where applicable)
Optimizer: Adam
LR schedule: linear warmup + cosine decay
Batch sizes (reference): Stage-1 256, Stage-2 512 (adjust in configs based on VRAM)

Data

Train/val in LMDB format

Training and validation are expected in LMDB format.

Recommended convention (per language):

LMDB/
  bn/
      train_lmdb/   # LMDB directory
      val_lmdb/
  en/
      train_lmdb/
      val_lmdb/
  ...

Datasets expected format

The evaluation pipeline expects a per-language folder layout with an accompanying manifest file detailing each image as:

Example (as used in internal runs):

"id": "1a4ada6d90f6dc00014085a496bdd4d11aaa4d45",
"path_rel": "en/cute80/images/cute80_english_47.png",
"filename": "cute80_english_47.png",
"gt_text": "BMW",
"lang": "en",
"dataset": "cute80"

Some datasets require normalization; this repo standardizes text by NFKD normalization and whitespace removal to keep evaluation consistent across scripts.

Synthetic Data generation

Stage-1 mCLIP is trained on synthetic character images rendered with varied fonts and augmentations.

Typical generation recipe:

per language: ~585K synthetic character samples and ~1.5M synthetic word samples for pretraining
fonts: multiple fonts with coverage checks per script
augmentations: ~9–10 augmentations (blur, noise, perspective, illumination, background blend, compression, etc.)
prompts/labels: a prompted format such as:
- <lang=en> <type=0> <char=A>
- <lang=hi> <type=1> <char=ा> (example subtype for diacritic)

For generating synthetic character and word crops (fonts, rendering, and augmentations), we use the external generator repo:

https://github.com/harshlunia7/Indic_Synthetic_Renderer.git

Please follow that repository’s instructions to produce the synthetic character and word dataset(s), then point this repo’s training config to the generated output directories.

Charsets and language specs

Charsets are typically stored as JSON per language with fields like:

letters, diacritics, digits, punctuation, common
optional script-specific markers (e.g., virama / nukta where relevant)

These are used for:

building the prototype bank (mCLIP text embeddings),
enforcing language-aware constraints in decoding (optional),
subtype supervision (letter vs diacritic vs punctuation/digit).

Models

Stage-1: mCLIP

A CLIP-style dual encoder trained at character level:

Image encoder: ResNet-like backbone (kept light to support large batches)
Text encoder: Transformer encoder over prompted tokens (QuickGELU/LayerNorm typical)
Embedding space: L2-normalized; similarity is cosine

Losses commonly used:

InfoNCE (symmetric image↔text contrastive)
ArcFace (optional): angular margin separation per class (or per language-aware class sets)
Subtype loss (optional): encourages separation between letters/diacritics/digits/punctuation

Symmetric contrastive objective: $$ \begin{aligned} \text{logits}{i2t} &= s \cdot (E{img} \cdot E_{txt}^T) \ \text{logits}{t2i} &= \text{logits}{i2t}^T \ \mathcal{L} &= \text{CE}(\text{logits}{i2t}, y{gt}) + \text{CE}(\text{logits}{t2i}, y{gt}) \end{aligned} $$

The outcome of Stage-1 is a frozen prototype bank: text embeddings for every character token (across all languages).

Stage-2: mSequenceSpotter

A word-image recognizer that predicts a sequence of embedding vectors (not class logits directly).

Key idea:

decode per-step embedding e_t
compute cosine similarity to frozen prototypes P (shape: [num_chars, D])
retrieve predicted token by argmax(sim(e_t, P))

Important note:

decoding uses cosine similarity scores directly; no exp/softmax is required to pick the token (argmax over similarities).

Also supports:

top-K hypotheses per step (used as a prefix memory input to the refiner)

Stage-2: mRefiner

A lightweight autoregressive refiner that corrects Spotter outputs by fusing:

Prefix memory: constructed from Spotter top-K token hypotheses
Visual memory: multi-scale feature memory (e.g., FPN-like) to preserve spatial evidence for diacritics and fine marks

This stage is trained after the Spotter. Since it depends on Spotter top-K behavior, training commonly uses:

either real Spotter outputs, or
a simulator that mimics Spotter’s error distribution during refiner training (to decouple training stages cleanly).

Training

Entry-point scripts vary; check specific scripts and configs/ in your branch.

1) Train Stage-1 (mCLIP)

Typical flow:

Generate synthetic character data (per language)
Train mCLIP with InfoNCE (+ optional ArcFace + subtype)

Example:

python main.py

Output:

checkpoint(s) under outputs/exp_name/...

You have to gnerate prototype bank (text embeddings for all characters)

2) Train Stage-2a (mSequenceSpotter)

Uses word images + GT sequences, but predicts embeddings and retrieves via prototype bank.

python train_mSeqSpotter.py

Post Training and finetuning on real dataset, generate the manifest files for evaluation of mSeqSpotter and mRefiner

3) Train Stage-2b (mRefiner)

Trained after Spotter; uses Spotter top-K and visual memory.

python train_mRefiner.py

Evaluation

Evaluate WRR/CRR per dataset per language (and optionally script accuracy if enabled):

python eval_mRefiner.py

Outputs typically include:

per-dataset metrics (WRR/CRR)
per-language breakdowns

Reproducing paper settings

Reference training strategy used in our runs:

Pretrain on synthetic: ~1.5M samples per language
Finetune on real: ~100K samples per language
Batch sizes: Stage-1 256, Stage-2 512 (GPU dependent)
Optimizer: Adam
Schedule: linear warmup + cosine decay
Normalization: all texts are NFKD normalized and whitespace removed

For multi-stage training feasibility under VRAM limits:

Stage-2 is trained in two parts: Spotter → Refiner
Refiner training may use a Spotter-output simulator; inference uses real Spotter top-K.

Adding a new language/script

High-level checklist:

Add charset JSON: configs/charset/<lang>.json
Ensure font coverage:
- add fonts and run coverage check (if provided)
Generate synthetic characters for the new script
Retrain Stage-1 to include new charset (prototype bank must include new chars)
Retrain Stage-2 in joint setting (or fine-tune if supported)

If you use ArcFace with language-specific margins, update margins carefully to avoid over-separating scripts in the shared space.

Notes for reviewers

This repo is meant to be self-contained for training/evaluation (synthetic + real data adapters), but datasets are not redistributed.
Key design choice: word recognition is performed in a shared embedding space and retrieves from a frozen multilingual prototype bank, avoiding script-specific classifier heads.

Citation

If you use this code, please cite:

@inproceedings{mstr2026,
  title     = {mSTR: Multilingual Scene Text Recognition in a Shared Character Embedding Space},
  author    = {Harsh Lunia, Ajoy Mondal, C V Jawahar},
  year      = {2026}
}

Contact

For questions/issues:

Open a GitHub Issue with:
- command/config used
- checkpoint name
- minimal log snippet
- dataset + language

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
mClip		mClip
mRefiner		mRefiner
mSequenceSpotter		mSequenceSpotter
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mSTR: Scalable Multilingual Scene Text Recognition with Positive Language-Scaling

Contents

Repository layout

Setup

Requirements

Install

Hardware

Data

Datasets expected format

Synthetic Data generation

Charsets and language specs

Models

Stage-1: mCLIP

Stage-2: mSequenceSpotter

Stage-2: mRefiner

Training

1) Train Stage-1 (mCLIP)

2) Train Stage-2a (mSequenceSpotter)

3) Train Stage-2b (mRefiner)

Evaluation

Reproducing paper settings

Adding a new language/script

Notes for reviewers

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mSTR: Scalable Multilingual Scene Text Recognition with Positive Language-Scaling

Contents

Repository layout

Setup

Requirements

Install

Hardware

Data

Datasets expected format

Synthetic Data generation

Charsets and language specs

Models

Stage-1: mCLIP

Stage-2: mSequenceSpotter

Stage-2: mRefiner

Training

1) Train Stage-1 (mCLIP)

2) Train Stage-2a (mSequenceSpotter)

3) Train Stage-2b (mRefiner)

Evaluation

Reproducing paper settings

Adding a new language/script

Notes for reviewers

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages