Skip to content

vlln/paper2report

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper to Report

A highly scalable, production-ready microservice that transforms a PDF Paper (e.g., arXiv preprint) from a URL into a comprehensive Markdown report. The service leverages an external API to extract figures from the PDF and an advanced AI model to analyze the document's content and visual elements, generating a structured summary.

This project is architected following the principles of high extensibility and separation of concerns, making it suitable for serverless deployment and easy to maintain and evolve.

Features

  • Simple API: A single GET endpoint /report accepts a pdf_url for processing.
  • Intelligent Figure Extraction: Integrates with an external service to identify and extract figures, tables, and their captions from the PDF.
  • AI-Powered Analysis: Utilizes a powerful Generative AI model (e.g., Google Gemini) via pydantic-ai to read the PDF and synthesize a report based on text and figures.
  • Dynamic Prompting: Uses Jinja2 templates for system prompts, allowing for flexible and powerful interaction with the AI model.
  • Robust Configuration: Manages all settings (API keys, URLs) via environment variables or a .env file using Pydantic for type safety.
  • Containerized: Comes with a Dockerfile for easy, reproducible builds and deployments on any cloud platform.
  • Scalable by Design: Built with a stateless architecture, making it ideal for serverless environments like Google Cloud Run, AWS Lambda, or Fargate.

Architecture & Workflow

The service is designed with a clean, layered architecture to ensure maintainability and testability.

  1. API Layer (api/routes.py): A Flask Blueprint defines the /report endpoint. It is responsible for parsing and validating the incoming request.
  2. Request Handling: The endpoint receives a pdf_url.
  3. Service Orchestration: The API route coordinates calls to two distinct services: a. PDFExtractorService (services/pdf_extractor.py): This service is responsible for all communication with the external PDF figure extraction API. It encapsulates the logic for making the request and parsing the response into strongly-typed Pydantic models. b. AIProcessor (services/ai_processor.py): This service handles all interactions with the AI model. It renders the Jinja2 prompt template with the extracted figure data and manages the pydantic-ai Agent to generate the final report.
  4. Response: The generated Markdown content is returned to the client with a text/markdown content type.
  5. Centralized Error Handling (utils/error_handlers.py): Custom exceptions and global handlers ensure that any failures in the workflow result in consistent, meaningful JSON error responses.
Client --(GET /report?pdf_url=...)--> [Flask App]
                                          |
                                          V
                                [API Route (/report)]
                                          |
        +---------------------------------+----------------------------------+
        |                                                                    |
        V                                                                    V
[PDFExtractorService] --(POST)--> [PDF Figures API]                 [AIProcessor]
        |                                                                    |
        <--(Figure Data)-----------------+                                   |
                                          |                                  |
                                          +----(PDF, Figure Data)-------->|
                                                                             |
                                                                             V
                                                                  [Pydantic-AI Agent] --(API Call)--> [AI Provider (Google)]
                                                                             |
                                                                             <--(Markdown Report)----+
                                                                                                      |
                                                                                                      V
Client <--(200 OK: Markdown | 5xx: JSON Error)-------------------------------------------------- [Flask App]

Getting Started

Prerequisites

  • Python 3.10+
  • Docker (for containerized deployment)
  • An API Key for a supported AI Provider (e.g., Google AI Studio for Gemini)

1. Clone the Repository

git clone <repository_url>
cd <repository_directory>

2. Configure Environment Variables

Copy the example environment file and fill in your credentials.

cp .env.example .env

Now, edit the .env file with your specific configuration:

# .env
PDFFIGURES_API_URL="https://extract-figures-cjadjukuvl.cn-hangzhou.fcapp.run/api/extract"
AI_PROVIDER_API_KEY="YOUR_GEMINI_API_KEY"
AI_PROVIDER_API_BASE="https://generativelanguage.googleapis.com"
AI_MODEL_NAME="gemini-2.5-flash"

3. Install Dependencies

It is recommended to use a virtual environment.

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

Usage

Running Locally (for Development)

The embedded Flask development server is suitable for local testing.

flask --app src.app:create_app run --host=0.0.0.0 --port=8000

Running with a Production Server (Gunicorn)

Gunicorn is a battle-tested WSGI server for running Python applications in production.

gunicorn --workers 4 --bind 0.0.0.0:8000 "src.app:create_app()"

Running with Docker

Build and run the Docker container for a standardized deployment environment.

# 1. Build the Docker image
docker build -t pdf-to-markdown-service .

# 2. Run the container, passing the .env file for configuration
docker run -p 8000:8000 --env-file ./.env pdf-to-markdown-service

API Endpoint

Generate Report

  • Endpoint: /report
  • Method: GET
  • Query Parameters:
    • pdf_url (required): The publicly accessible URL of the PDF to be processed.

Example Request (curl):

curl -X GET "http://localhost:8000/report?pdf_url=https://arxiv.org/pdf/2403.05530.pdf" -o report.md

Success Response (200 OK):

The body of the response will be the raw Markdown content.

Error Responses:

  • 400 Bad Request: If the pdf_url parameter is missing or invalid.
  • 502 Bad Gateway: If the downstream PDF extraction service or the AI service fails.

The error response body will be a JSON object:

{
  "error": {
    "type": "PDFExtractionError",
    "message": "Failed to communicate with PDF extraction service: ..."
  }
}

Health Check

  • Endpoint: /health
  • Method: GET
  • Description: A simple endpoint to verify that the service is running. Returns OK with a 200 status code.

Configuration

All configuration is managed via environment variables, documented below.

Variable Description Default Value
PDFFIGURES_API_URL The endpoint for the PDF figure extraction service. https://extract.../api/extract
AI_PROVIDER_API_KEY The API key for your chosen AI provider (e.g., Google). (Required)
AI_PROVIDER_API_BASE The base URL for the AI provider's API. https://generativelanguage.googleapis.com
AI_MODEL_NAME The specific model to use for generation (e.g., gemini-1.5-flash-latest). gemini-1.5-flash-latest

Design Philosophy & Extensibility

This service is engineered with the following first principles in mind:

  • Separation of Concerns: Each component has a single responsibility. The api layer handles HTTP concerns, services contain business logic, and utils provide shared functionality. This makes the codebase easier to understand, test, and refactor.
  • Dependency Inversion: The API routes depend on abstractions (the service classes) rather than concrete implementations. This makes it trivial to swap out dependencies. For example, if you wanted to switch from the current PDF extractor to an in-house solution, you would only need to create a new class that conforms to the expected interface and instantiate it in the route, with no changes to the AI service or API logic.
  • Configuration as Code: By externalizing all configuration to the environment, the application artifact (the Docker image) is immutable and can be promoted across different environments (development, staging, production) without modification.
  • Strongly-Typed Data: Pydantic models are used for configuration (config.py) and API data structures (services/pdf_extractor.py). This eliminates a whole class of runtime errors, improves developer experience with auto-completion, and serves as clear documentation for data contracts.

License

This project is licensed under the Apache 2.0 License.

About

A service that transforms a PDF Paper (e.g., arXiv preprint) from a URL into a comprehensive Markdown report.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors