Skip to content

harshlunia7/mSTR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mSTR: Scalable Multilingual Scene Text Recognition with Positive Language-Scaling

This repository contains the code for mSTR, a two-stage multilingual Scene Text Recognition (STR) system designed to scale across scripts without brittle script routing or an exploding per-script decoder.

At a high level:

  • Stage-1 (mCLIP): trains a CLIP-style character image ↔ character text dual-encoder on synthetic character crops to learn a shared multilingual character embedding manifold. Training is primarily contrastive (InfoNCE) and is optionally regularized with ArcFace and Subtype supervision to reduce cross-script overlap and near-neighbor confusions.
  • Stage-2 (mSequenceSpotter + mRefiner):
    • mSequenceSpotter: predicts a sequence of character embeddings for a word image and retrieves symbols by cosine similarity against a frozen prototype bank (from mCLIP).
    • mRefiner: performs lightweight autoregressive correction using (i) prefix memory from Spotter top-K hypotheses and (ii) a multi-scale visual memory to preserve spatial evidence for subtle marks (diacritics, conjunct cues, etc.).

The repo is organized to support:

  • training on synthetic + fine-tuning on real data,
  • 1-language / 3-language / 6-language joint settings,
  • evaluation with WRR/CRR and script detection accuracy.

Contents


Repository layout

The exact filenames may vary slightly across branches; use this section as a map of intent.

Typical structure:

.
├── configs/                 # YAML/JSON configs for training/eval
│   ├── train_config.*       # stage-wise training configs
│   ├── eval_config.*        # evaluation configs (datasets, ckpts, decoding)
│   └── langs/               # per-language charset JSONs (letters/diacritics/digits/punct/common)
├── model/
│   ├── mCLIP/               # dual encoder (image encoder + text encoder)
│   ├── mSeqSpotter.py       # word-image embedding decoder + retrieval head
│   └── mRefiner.py          # refinement decoder using prefix + visual memory
├── util/                    # helpers: normalization, encoding, prototype bank, metrics, logging
├── outputs/                 # logs, checkpoints, predictions, visualizations (created at runtime)
└── README.md

Setup

Requirements

  • Python 3.9+ (recommended)
  • PyTorch (CUDA recommended)
  • Common packages: numpy, opencv-python, Pillow, tqdm, pyyaml, etc.
  • LMDB support: lmdb

Install

# (Recommended) create env
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# install deps
pip install -r requirements.txt

Hardware

Training is GPU-heavy. The experiments were designed to keep the Stage-1 encoder relatively light to enable large batches for contrastive learning.

Reference training setup used:

  • GPU: GeForce RTX 2080 Ti (DataParallel used where applicable)
  • Optimizer: Adam
  • LR schedule: linear warmup + cosine decay
  • Batch sizes (reference): Stage-1 256, Stage-2 512 (adjust in configs based on VRAM)

Data

Train/val in LMDB format

Training and validation are expected in LMDB format.

Recommended convention (per language):

LMDB/
  bn/
      train_lmdb/   # LMDB directory
      val_lmdb/
  en/
      train_lmdb/
      val_lmdb/
  ...

Datasets expected format

The evaluation pipeline expects a per-language folder layout with an accompanying manifest file detailing each image as:

Example (as used in internal runs):

"id": "1a4ada6d90f6dc00014085a496bdd4d11aaa4d45",
"path_rel": "en/cute80/images/cute80_english_47.png",
"filename": "cute80_english_47.png",
"gt_text": "BMW",
"lang": "en",
"dataset": "cute80"

Some datasets require normalization; this repo standardizes text by NFKD normalization and whitespace removal to keep evaluation consistent across scripts.


Synthetic Data generation

Stage-1 mCLIP is trained on synthetic character images rendered with varied fonts and augmentations.

Typical generation recipe:

  • per language: ~585K synthetic character samples and ~1.5M synthetic word samples for pretraining

  • fonts: multiple fonts with coverage checks per script

  • augmentations: ~9–10 augmentations (blur, noise, perspective, illumination, background blend, compression, etc.)

  • prompts/labels: a prompted format such as:

    • <lang=en> <type=0> <char=A>
    • <lang=hi> <type=1> <char=ा> (example subtype for diacritic)

For generating synthetic character and word crops (fonts, rendering, and augmentations), we use the external generator repo:

Please follow that repository’s instructions to produce the synthetic character and word dataset(s), then point this repo’s training config to the generated output directories.


Charsets and language specs

Charsets are typically stored as JSON per language with fields like:

  • letters, diacritics, digits, punctuation, common
  • optional script-specific markers (e.g., virama / nukta where relevant)

These are used for:

  • building the prototype bank (mCLIP text embeddings),
  • enforcing language-aware constraints in decoding (optional),
  • subtype supervision (letter vs diacritic vs punctuation/digit).

Models

Stage-1: mCLIP

A CLIP-style dual encoder trained at character level:

  • Image encoder: ResNet-like backbone (kept light to support large batches)
  • Text encoder: Transformer encoder over prompted tokens (QuickGELU/LayerNorm typical)
  • Embedding space: L2-normalized; similarity is cosine

Losses commonly used:

  • InfoNCE (symmetric image↔text contrastive)
  • ArcFace (optional): angular margin separation per class (or per language-aware class sets)
  • Subtype loss (optional): encourages separation between letters/diacritics/digits/punctuation

Symmetric contrastive objective: $$ \begin{aligned} \text{logits}{i2t} &= s \cdot (E{img} \cdot E_{txt}^T) \ \text{logits}{t2i} &= \text{logits}{i2t}^T \ \mathcal{L} &= \text{CE}(\text{logits}{i2t}, y{gt}) + \text{CE}(\text{logits}{t2i}, y{gt}) \end{aligned} $$

The outcome of Stage-1 is a frozen prototype bank: text embeddings for every character token (across all languages).


Stage-2: mSequenceSpotter

A word-image recognizer that predicts a sequence of embedding vectors (not class logits directly).

Key idea:

  • decode per-step embedding e_t
  • compute cosine similarity to frozen prototypes P (shape: [num_chars, D])
  • retrieve predicted token by argmax(sim(e_t, P))

Important note:

  • decoding uses cosine similarity scores directly; no exp/softmax is required to pick the token (argmax over similarities).

Also supports:

  • top-K hypotheses per step (used as a prefix memory input to the refiner)

Stage-2: mRefiner

A lightweight autoregressive refiner that corrects Spotter outputs by fusing:

  1. Prefix memory: constructed from Spotter top-K token hypotheses
  2. Visual memory: multi-scale feature memory (e.g., FPN-like) to preserve spatial evidence for diacritics and fine marks

This stage is trained after the Spotter. Since it depends on Spotter top-K behavior, training commonly uses:

  • either real Spotter outputs, or
  • a simulator that mimics Spotter’s error distribution during refiner training (to decouple training stages cleanly).

Training

Entry-point scripts vary; check specific scripts and configs/ in your branch.

1) Train Stage-1 (mCLIP)

Typical flow:

  1. Generate synthetic character data (per language)
  2. Train mCLIP with InfoNCE (+ optional ArcFace + subtype)

Example:

python main.py

Output:

  • checkpoint(s) under outputs/exp_name/...

You have to gnerate prototype bank (text embeddings for all characters)


2) Train Stage-2a (mSequenceSpotter)

Uses word images + GT sequences, but predicts embeddings and retrieves via prototype bank.

python train_mSeqSpotter.py

Post Training and finetuning on real dataset, generate the manifest files for evaluation of mSeqSpotter and mRefiner


3) Train Stage-2b (mRefiner)

Trained after Spotter; uses Spotter top-K and visual memory.

python train_mRefiner.py

Evaluation

Evaluate WRR/CRR per dataset per language (and optionally script accuracy if enabled):

python eval_mRefiner.py 

Outputs typically include:

  • per-dataset metrics (WRR/CRR)
  • per-language breakdowns

Reproducing paper settings

Reference training strategy used in our runs:

  • Pretrain on synthetic: ~1.5M samples per language
  • Finetune on real: ~100K samples per language
  • Batch sizes: Stage-1 256, Stage-2 512 (GPU dependent)
  • Optimizer: Adam
  • Schedule: linear warmup + cosine decay
  • Normalization: all texts are NFKD normalized and whitespace removed

For multi-stage training feasibility under VRAM limits:

  • Stage-2 is trained in two parts: Spotter → Refiner
  • Refiner training may use a Spotter-output simulator; inference uses real Spotter top-K.

Adding a new language/script

High-level checklist:

  1. Add charset JSON: configs/charset/<lang>.json

  2. Ensure font coverage:

    • add fonts and run coverage check (if provided)
  3. Generate synthetic characters for the new script

  4. Retrain Stage-1 to include new charset (prototype bank must include new chars)

  5. Retrain Stage-2 in joint setting (or fine-tune if supported)

If you use ArcFace with language-specific margins, update margins carefully to avoid over-separating scripts in the shared space.


Notes for reviewers

  • This repo is meant to be self-contained for training/evaluation (synthetic + real data adapters), but datasets are not redistributed.
  • Key design choice: word recognition is performed in a shared embedding space and retrieves from a frozen multilingual prototype bank, avoiding script-specific classifier heads.

Citation

If you use this code, please cite:

@inproceedings{mstr2026,
  title     = {mSTR: Multilingual Scene Text Recognition in a Shared Character Embedding Space},
  author    = {Harsh Lunia, Ajoy Mondal, C V Jawahar},
  year      = {2026}
}

Contact

For questions/issues:

  • Open a GitHub Issue with:

    • command/config used
    • checkpoint name
    • minimal log snippet
    • dataset + language

Releases

No releases published

Packages

 
 
 

Contributors