An installable progressive web application (PWA) for running privacy-focused OCR using LM Studio (local) or Google Gemini (cloud).
This app is designed to streamline the workflow of converting documents (PDFs, Images) into structured Markdown and HTML, leveraging the power of modern Vision Language Models (VLMs).
- LM Studio (Local): Connects to your local LM Studio instance via its OpenAI-compatible server. Privacy-first, no data leaves your machine.
- Google Gemini (Cloud): Uses Google's generative AI models for high-accuracy OCR if you have an API key.
- Open Folder: Work directly with files on your local file system.
- Sidecar Output: OCR results are saved as
.jsonsidecar files next to your images (e.g.,image.png->image.json). - Markdown Export: Automatically generates markdown files for easy reading and documentation.
- Batch OCR: Process multiple selected files in queue.
- Smart Skip: "Skip Processed" option prevents re-running OCR on already analyzed files.
- PDF Tools:
- PDF to Images: Convert PDF pages into individual JPEGs for optimal OCR accuracy.
- Split Pages: Automatically split double-page scans into single pages.
- Side-by-Side View: Toggle between:
- Original Image
- Annotated Image (with bounding boxes for detected text/layout)
- Markdown/HTML (structured text output)
- Live Updates: Changes in storage are immediately reflected in the viewer.
- LM Studio: Download here.
- A Vision Model: Search for and download a vision-capable model (like
Qwen-VL,Llava,BakLLaVA, orGemma-3-Vision). - Local Server: Start the LM Studio Local Server (usually port
1234).
- Google API Key: Get one from Google AI Studio.
-
Clone the repository:
git clone <repo-url> cd lmstudio_ocr_pwa
-
Install dependencies:
npm install
-
Run locally:
npm run dev
Open
http://localhost:5173in your browser.
Click the Settings (⚙️) icon in the toolbar.
- LM Studio: Check that Base URL is
http://localhost:1234(or your custom port) and ensure your model is loaded in LM Studio. - Google Gemini: Select "Google Gemini" and paste your API Key.
- Click Open Folder in the sidebar to select a directory containing your documents.
- The sidebar lists all supported files (
.png,.jpg,.jpeg,.webp,.pdf). - Click a file to view it.
- Shift+Click or Cmd/Ctrl+Click to select multiple files.
- Select one or more files in the sidebar.
- Click Run OCR in the top toolbar.
- The app will process each file using the selected provider.
- Results are saved to disk automatically.
- Split Pages: If you have Scanned Double-Pages (e.g., a book scan), select them and click "Split Pages". The app will generate
_L.jpgand_R.jpgfor left and right pages. - PDF to Images: Convert a PDF into a folder of JPEG images for easier processing.
src/App.tsx: Main application logic and state management.src/components/layout/WorkspaceLayout.tsx: Core layout with Sidebar, Toolbar, and Content.src/components/InstructionPage.tsx: In-app help guide.src/lmStudioClient.ts/src/geminiClient.ts: API clients for the respective providers.src/storage/ocrFileSystem.ts: Handles reading/writing results to the user's local file system.
MIT License. Feel free to fork and modify for your own use.