Intelligent Document OCR & Extraction

An advanced OCR and data extraction application that leverages Vision Language Models (VLM) to intelligent parse and structure data from various document types. Built with Next.js for the frontend and Python FastAPI for the backend.

🚀 Features

Multi-Document Support: Specialized extraction schemas for:
- 📄 Invoices: Extracts vendor, date, line items, and totals.
- 🛂 Passports: Extracts MRZ data, personal details, and validity dates.
- 🆔 Thai Citizen IDs: Extracts Thai/English names, ID number, and address.
- 🏢 DBD Certificates: Parses Thai business registration details.
Auto-Classification: Intelligent routing that automatically detects the document type and applies the correct extraction logic.
Vision-First Approach: Uses Vision Language Models (like Qwen2-VL) to "see" documents, handling complex layouts, tables, and rotated pages better than traditional OCR.
Hybrid Processing pipeline: Combines PyMuPDF for high-fidelity PDF rendering with VLM for semantic understanding.
Real-time Statistics: View OCR processing time, AI token usage, and total latency.
Robust JSON Output: Guarantees valid JSON responses using json-repair for reliable integration.

🛠 Tech Stack

Frontend

Framework: Next.js 15 (React 19)
Styling: Tailwind CSS v4
Language: TypeScript
UI Components: Lucide React, Framer Motion

Backend

Framework: FastAPI (Python)
Image Processing: PyMuPDF (fitz), Pillow (PIL)
AI Integration: OpenAI SDK (compatible with LM Studio)
Utilities: json_repair for resilient parsing

AI Infrastructure

Local Inference: Designed to work with LM Studio.
Recommended Model: qwen/qwen3-vl-2b (or other Vision-capable models).

📋 Prerequisites

Before you begin, ensure you have the following installed:

Node.js (v18 or higher)
Python (3.10 or higher)
LM Studio (running locally)

⚙️ Installation & Setup

1. Setup AI Server (LM Studio)

Download and install LM Studio.
Search for and download a Vision model (e.g., Qwen2-VL-2B-Instruct).
Go to the Developer Server tab (double-arrow icon).
Select your downloaded model from the top dropdown.
Start the server. Ensure it is listening on http://localhost:1234.

2. Setup Backend

Open a terminal and navigate to the ocr-backend directory:

cd ocr-backend

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install dependencies
pip install fastapi "uvicorn[standard]" python-multipart pymupdf pytesseract pillow openai json_repair

# Start the Backend Server
python main.py

The backend will start at http://localhost:8000.

3. Setup Frontend

Open a new terminal window and navigate to the root directory:

# Install dependencies
npm install

# Start the Development Server
npm run dev

The application will be available at http://localhost:3000.

📖 Usage

Open your browser to http://localhost:3000.
Select Document Type: Choose specific type (Invoice, Passport, etc.) or leave as "Auto" for automatic detection.
Upload File: Drag and drop a PDF or Image file (PNG, JPG).
View Results:
- Markdown: A human-readable summary of the extracted data.
- JSON: The structured data object ready for API consumption.
- Stats: Performance metrics for the current job.

🔌 API Documentation

`POST /process`

Uploads a file for OCR and extraction.

Parameters:

file: The document file (PDF, JPG, PNG).
doc_type: The type of document to process. Options: auto, invoice, passport, citizen_id, dbd.

Response: Returns a JSON object containing:

markdown: Formatted summary string.
json: Structured data object corresponding to the document type.
stats: Processing timing and token usage information.

📂 Project Structure

├── ocr-backend/        # Python FastAPI Server
│   ├── main.py        # Core application logic & prompt definitions
│   └── ...
├── src/               # Next.js Frontend Source
│   ├── app/          # App Router pages and API routes
│   ├── components/   # React UI components
│   └── ...
├── public/            # Static assets
└── package.json       # Frontend dependencies

📜 License

This project is open-source and available under the simple MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ocr-backend		ocr-backend
public		public
scripts		scripts
src		src
.DS_Store		.DS_Store
README.md		README.md
README_FINETUNE.md		README_FINETUNE.md
eslint.config.mjs		eslint.config.mjs
next-env.d.ts		next-env.d.ts
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json
tsconfig.tsbuildinfo		tsconfig.tsbuildinfo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intelligent Document OCR & Extraction

🚀 Features

🛠 Tech Stack

Frontend

Backend

AI Infrastructure

📋 Prerequisites

⚙️ Installation & Setup

1. Setup AI Server (LM Studio)

2. Setup Backend

3. Setup Frontend

📖 Usage

🔌 API Documentation

`POST /process`

📂 Project Structure

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intelligent Document OCR & Extraction

🚀 Features

🛠 Tech Stack

Frontend

Backend

AI Infrastructure

📋 Prerequisites

⚙️ Installation & Setup

1. Setup AI Server (LM Studio)

2. Setup Backend

3. Setup Frontend

📖 Usage

🔌 API Documentation

POST /process

📂 Project Structure

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /process`

Packages