Skip to content

tumme/ai-ocr-document

Repository files navigation

Intelligent Document OCR & Extraction

An advanced OCR and data extraction application that leverages Vision Language Models (VLM) to intelligent parse and structure data from various document types. Built with Next.js for the frontend and Python FastAPI for the backend.

🚀 Features

  • Multi-Document Support: Specialized extraction schemas for:
    • 📄 Invoices: Extracts vendor, date, line items, and totals.
    • 🛂 Passports: Extracts MRZ data, personal details, and validity dates.
    • 🆔 Thai Citizen IDs: Extracts Thai/English names, ID number, and address.
    • 🏢 DBD Certificates: Parses Thai business registration details.
  • Auto-Classification: Intelligent routing that automatically detects the document type and applies the correct extraction logic.
  • Vision-First Approach: Uses Vision Language Models (like Qwen2-VL) to "see" documents, handling complex layouts, tables, and rotated pages better than traditional OCR.
  • Hybrid Processing pipeline: Combines PyMuPDF for high-fidelity PDF rendering with VLM for semantic understanding.
  • Real-time Statistics: View OCR processing time, AI token usage, and total latency.
  • Robust JSON Output: Guarantees valid JSON responses using json-repair for reliable integration.

🛠 Tech Stack

Frontend

  • Framework: Next.js 15 (React 19)
  • Styling: Tailwind CSS v4
  • Language: TypeScript
  • UI Components: Lucide React, Framer Motion

Backend

  • Framework: FastAPI (Python)
  • Image Processing: PyMuPDF (fitz), Pillow (PIL)
  • AI Integration: OpenAI SDK (compatible with LM Studio)
  • Utilities: json_repair for resilient parsing

AI Infrastructure

  • Local Inference: Designed to work with LM Studio.
  • Recommended Model: qwen/qwen3-vl-2b (or other Vision-capable models).

📋 Prerequisites

Before you begin, ensure you have the following installed:

⚙️ Installation & Setup

1. Setup AI Server (LM Studio)

  1. Download and install LM Studio.
  2. Search for and download a Vision model (e.g., Qwen2-VL-2B-Instruct).
  3. Go to the Developer Server tab (double-arrow icon).
  4. Select your downloaded model from the top dropdown.
  5. Start the server. Ensure it is listening on http://localhost:1234.

2. Setup Backend

Open a terminal and navigate to the ocr-backend directory:

cd ocr-backend

# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install dependencies
pip install fastapi "uvicorn[standard]" python-multipart pymupdf pytesseract pillow openai json_repair

# Start the Backend Server
python main.py

The backend will start at http://localhost:8000.

3. Setup Frontend

Open a new terminal window and navigate to the root directory:

# Install dependencies
npm install

# Start the Development Server
npm run dev

The application will be available at http://localhost:3000.

📖 Usage

  1. Open your browser to http://localhost:3000.
  2. Select Document Type: Choose specific type (Invoice, Passport, etc.) or leave as "Auto" for automatic detection.
  3. Upload File: Drag and drop a PDF or Image file (PNG, JPG).
  4. View Results:
    • Markdown: A human-readable summary of the extracted data.
    • JSON: The structured data object ready for API consumption.
    • Stats: Performance metrics for the current job.

🔌 API Documentation

POST /process

Uploads a file for OCR and extraction.

Parameters:

  • file: The document file (PDF, JPG, PNG).
  • doc_type: The type of document to process. Options: auto, invoice, passport, citizen_id, dbd.

Response: Returns a JSON object containing:

  • markdown: Formatted summary string.
  • json: Structured data object corresponding to the document type.
  • stats: Processing timing and token usage information.

📂 Project Structure

├── ocr-backend/        # Python FastAPI Server
│   ├── main.py        # Core application logic & prompt definitions
│   └── ...
├── src/               # Next.js Frontend Source
│   ├── app/          # App Router pages and API routes
│   ├── components/   # React UI components
│   └── ...
├── public/            # Static assets
└── package.json       # Frontend dependencies

📜 License

This project is open-source and available under the simple MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors