A highly scalable, production-ready microservice that transforms a PDF Paper (e.g., arXiv preprint) from a URL into a comprehensive Markdown report. The service leverages an external API to extract figures from the PDF and an advanced AI model to analyze the document's content and visual elements, generating a structured summary.
This project is architected following the principles of high extensibility and separation of concerns, making it suitable for serverless deployment and easy to maintain and evolve.
- Simple API: A single GET endpoint
/reportaccepts apdf_urlfor processing. - Intelligent Figure Extraction: Integrates with an external service to identify and extract figures, tables, and their captions from the PDF.
- AI-Powered Analysis: Utilizes a powerful Generative AI model (e.g., Google Gemini) via
pydantic-aito read the PDF and synthesize a report based on text and figures. - Dynamic Prompting: Uses Jinja2 templates for system prompts, allowing for flexible and powerful interaction with the AI model.
- Robust Configuration: Manages all settings (API keys, URLs) via environment variables or a
.envfile using Pydantic for type safety. - Containerized: Comes with a
Dockerfilefor easy, reproducible builds and deployments on any cloud platform. - Scalable by Design: Built with a stateless architecture, making it ideal for serverless environments like Google Cloud Run, AWS Lambda, or Fargate.
The service is designed with a clean, layered architecture to ensure maintainability and testability.
- API Layer (
api/routes.py): A Flask Blueprint defines the/reportendpoint. It is responsible for parsing and validating the incoming request. - Request Handling: The endpoint receives a
pdf_url. - Service Orchestration: The API route coordinates calls to two distinct services:
a.
PDFExtractorService(services/pdf_extractor.py): This service is responsible for all communication with the external PDF figure extraction API. It encapsulates the logic for making the request and parsing the response into strongly-typed Pydantic models. b.AIProcessor(services/ai_processor.py): This service handles all interactions with the AI model. It renders the Jinja2 prompt template with the extracted figure data and manages thepydantic-aiAgent to generate the final report. - Response: The generated Markdown content is returned to the client with a
text/markdowncontent type. - Centralized Error Handling (
utils/error_handlers.py): Custom exceptions and global handlers ensure that any failures in the workflow result in consistent, meaningful JSON error responses.
Client --(GET /report?pdf_url=...)--> [Flask App]
|
V
[API Route (/report)]
|
+---------------------------------+----------------------------------+
| |
V V
[PDFExtractorService] --(POST)--> [PDF Figures API] [AIProcessor]
| |
<--(Figure Data)-----------------+ |
| |
+----(PDF, Figure Data)-------->|
|
V
[Pydantic-AI Agent] --(API Call)--> [AI Provider (Google)]
|
<--(Markdown Report)----+
|
V
Client <--(200 OK: Markdown | 5xx: JSON Error)-------------------------------------------------- [Flask App]
- Python 3.10+
- Docker (for containerized deployment)
- An API Key for a supported AI Provider (e.g., Google AI Studio for Gemini)
git clone <repository_url>
cd <repository_directory>Copy the example environment file and fill in your credentials.
cp .env.example .envNow, edit the .env file with your specific configuration:
# .env
PDFFIGURES_API_URL="https://extract-figures-cjadjukuvl.cn-hangzhou.fcapp.run/api/extract"
AI_PROVIDER_API_KEY="YOUR_GEMINI_API_KEY"
AI_PROVIDER_API_BASE="https://generativelanguage.googleapis.com"
AI_MODEL_NAME="gemini-2.5-flash"It is recommended to use a virtual environment.
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txtThe embedded Flask development server is suitable for local testing.
flask --app src.app:create_app run --host=0.0.0.0 --port=8000Gunicorn is a battle-tested WSGI server for running Python applications in production.
gunicorn --workers 4 --bind 0.0.0.0:8000 "src.app:create_app()"Build and run the Docker container for a standardized deployment environment.
# 1. Build the Docker image
docker build -t pdf-to-markdown-service .
# 2. Run the container, passing the .env file for configuration
docker run -p 8000:8000 --env-file ./.env pdf-to-markdown-service- Endpoint:
/report - Method:
GET - Query Parameters:
pdf_url(required): The publicly accessible URL of the PDF to be processed.
Example Request (curl):
curl -X GET "http://localhost:8000/report?pdf_url=https://arxiv.org/pdf/2403.05530.pdf" -o report.mdSuccess Response (200 OK):
The body of the response will be the raw Markdown content.
Error Responses:
- 400 Bad Request: If the
pdf_urlparameter is missing or invalid. - 502 Bad Gateway: If the downstream PDF extraction service or the AI service fails.
The error response body will be a JSON object:
{
"error": {
"type": "PDFExtractionError",
"message": "Failed to communicate with PDF extraction service: ..."
}
}- Endpoint:
/health - Method:
GET - Description: A simple endpoint to verify that the service is running. Returns
OKwith a 200 status code.
All configuration is managed via environment variables, documented below.
| Variable | Description | Default Value |
|---|---|---|
PDFFIGURES_API_URL |
The endpoint for the PDF figure extraction service. | https://extract.../api/extract |
AI_PROVIDER_API_KEY |
The API key for your chosen AI provider (e.g., Google). | (Required) |
AI_PROVIDER_API_BASE |
The base URL for the AI provider's API. | https://generativelanguage.googleapis.com |
AI_MODEL_NAME |
The specific model to use for generation (e.g., gemini-1.5-flash-latest). |
gemini-1.5-flash-latest |
This service is engineered with the following first principles in mind:
- Separation of Concerns: Each component has a single responsibility. The
apilayer handles HTTP concerns,servicescontain business logic, andutilsprovide shared functionality. This makes the codebase easier to understand, test, and refactor. - Dependency Inversion: The API routes depend on abstractions (the service classes) rather than concrete implementations. This makes it trivial to swap out dependencies. For example, if you wanted to switch from the current PDF extractor to an in-house solution, you would only need to create a new class that conforms to the expected interface and instantiate it in the route, with no changes to the AI service or API logic.
- Configuration as Code: By externalizing all configuration to the environment, the application artifact (the Docker image) is immutable and can be promoted across different environments (development, staging, production) without modification.
- Strongly-Typed Data: Pydantic models are used for configuration (
config.py) and API data structures (services/pdf_extractor.py). This eliminates a whole class of runtime errors, improves developer experience with auto-completion, and serves as clear documentation for data contracts.
This project is licensed under the Apache 2.0 License.