An advanced OCR and data extraction application that leverages Vision Language Models (VLM) to intelligent parse and structure data from various document types. Built with Next.js for the frontend and Python FastAPI for the backend.
- Multi-Document Support: Specialized extraction schemas for:
- 📄 Invoices: Extracts vendor, date, line items, and totals.
- 🛂 Passports: Extracts MRZ data, personal details, and validity dates.
- 🆔 Thai Citizen IDs: Extracts Thai/English names, ID number, and address.
- 🏢 DBD Certificates: Parses Thai business registration details.
- Auto-Classification: Intelligent routing that automatically detects the document type and applies the correct extraction logic.
- Vision-First Approach: Uses Vision Language Models (like Qwen2-VL) to "see" documents, handling complex layouts, tables, and rotated pages better than traditional OCR.
- Hybrid Processing pipeline: Combines PyMuPDF for high-fidelity PDF rendering with VLM for semantic understanding.
- Real-time Statistics: View OCR processing time, AI token usage, and total latency.
- Robust JSON Output: Guarantees valid JSON responses using
json-repairfor reliable integration.
- Framework: Next.js 15 (React 19)
- Styling: Tailwind CSS v4
- Language: TypeScript
- UI Components: Lucide React, Framer Motion
- Framework: FastAPI (Python)
- Image Processing: PyMuPDF (fitz), Pillow (PIL)
- AI Integration: OpenAI SDK (compatible with LM Studio)
- Utilities:
json_repairfor resilient parsing
- Local Inference: Designed to work with LM Studio.
- Recommended Model:
qwen/qwen3-vl-2b(or other Vision-capable models).
Before you begin, ensure you have the following installed:
- Download and install LM Studio.
- Search for and download a Vision model (e.g.,
Qwen2-VL-2B-Instruct). - Go to the Developer Server tab (double-arrow icon).
- Select your downloaded model from the top dropdown.
- Start the server. Ensure it is listening on
http://localhost:1234.
Open a terminal and navigate to the ocr-backend directory:
cd ocr-backend
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
# Install dependencies
pip install fastapi "uvicorn[standard]" python-multipart pymupdf pytesseract pillow openai json_repair
# Start the Backend Server
python main.pyThe backend will start at http://localhost:8000.
Open a new terminal window and navigate to the root directory:
# Install dependencies
npm install
# Start the Development Server
npm run devThe application will be available at http://localhost:3000.
- Open your browser to
http://localhost:3000. - Select Document Type: Choose specific type (Invoice, Passport, etc.) or leave as "Auto" for automatic detection.
- Upload File: Drag and drop a PDF or Image file (PNG, JPG).
- View Results:
- Markdown: A human-readable summary of the extracted data.
- JSON: The structured data object ready for API consumption.
- Stats: Performance metrics for the current job.
Uploads a file for OCR and extraction.
Parameters:
file: The document file (PDF, JPG, PNG).doc_type: The type of document to process. Options:auto,invoice,passport,citizen_id,dbd.
Response: Returns a JSON object containing:
markdown: Formatted summary string.json: Structured data object corresponding to the document type.stats: Processing timing and token usage information.
├── ocr-backend/ # Python FastAPI Server
│ ├── main.py # Core application logic & prompt definitions
│ └── ...
├── src/ # Next.js Frontend Source
│ ├── app/ # App Router pages and API routes
│ ├── components/ # React UI components
│ └── ...
├── public/ # Static assets
└── package.json # Frontend dependencies
This project is open-source and available under the simple MIT License.