Document Conversion and RAG System

This project provides a document processing and Retrieval Augmented Generation (RAG) system for extracting, chunking, and querying information from various document formats.

Features

Convert PDF, HTML, Markdown, and image documents
Extract text, tables, and images from documents
Process and chunk documents using various strategies (Character, Recursive, Semantic)
Store document chunks in a vector database (Qdrant)
Query document content using natural language with RAG
Visualize and export document elements
Web interface for document upload and interactive Q&A

Prerequisites

Python 3.10+
Docker and Docker Compose (for containerized deployment)
OpenAI API key (for LLM capabilities)

Installation

Option 1: Local Installation

Clone the repository:

git clone <repository-url>
cd doc-convert

Install dependencies:
```
pip install -r requirements.txt
```

Run Qdrant server (required for vector storage):

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

Set up environment variables: Create a .env file in the project root with:
```
OPENAI_API_KEY=your_openai_api_key
```

Option 2: Docker Deployment

Clone the repository:

git clone <repository-url>
cd doc-convert

Create a .env file with:
```
OPENAI_API_KEY=your_openai_api_key
```
Run the application stack:
```
docker-compose up
```

Usage

Document Conversion

Convert PDF documents to Markdown using the test1.py script:

python test1.py

This will:

Download a sample PDF from arXiv
Convert it to Markdown format
Save the output to ./output/ directory

Figure Export

Extract figures and tables from PDF documents using the figure-export.ipynb notebook:

Open the notebook in Jupyter:
```
jupyter notebook figure-export.ipynb
```
Run the cells to extract and save images from the PDF

RAG Web Interface

Use the Streamlit web application to process documents and ask questions:

Start the application:
```
streamlit run rag_streamlit_app.py
```
Access the web interface at http://localhost:8501
Enter your OpenAI API key (if not set in .env)
Input a document URL to process
Ask questions about the document content

Chunking Strategies

The system supports multiple chunking strategies:

Character Chunking: Divides text based on a fixed number of characters
Recursive Character Chunking: Divides text recursively until a condition is met
Document Specific Chunking: Respects document structure (paragraphs, sections)
Semantic Chunking: Groups text by semantic relationships
Token-based Chunking: Divides text based on token count

Development

Project Structure

rag_streamlit_app.py: Main Streamlit web application
test1.py: PDF to Markdown converter script
test2.py and test2.ipynb: Document chunking and vector search examples
figure-export.ipynb: Extract figures and tables from PDFs
requirements.txt: Python dependencies
Dockerfile and docker-compose.yml: Containerization configuration

Adding New Features

New Document Formats:
- Extend the DocumentConverter class with additional format handlers
- Add the format to allowed_formats in converter initialization
Custom Chunking Strategies:
- Create a new chunker class extending base chunkers
- Implement the chunk method
Improving RAG:
- Modify the LangChain components in setup_langchain_rag function
- Adjust parameters in the retriever or document chain

Troubleshooting

Memory Issues: For large documents, increase Docker memory limits or process documents in smaller batches
PDF Conversion Issues: Check PDF permissions or try alternative URLs
Qdrant Connectivity: Ensure Qdrant server is running and accessible

License

[Specify License]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Conversion and RAG System

Features

Prerequisites

Installation

Option 1: Local Installation

Option 2: Docker Deployment

Usage

Document Conversion

Figure Export

RAG Web Interface

Chunking Strategies

Development

Project Structure

Adding New Features

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
figure-export.ipynb		figure-export.ipynb
rag_streamlit_app.py		rag_streamlit_app.py
requirements.txt		requirements.txt
test1.py		test1.py
test2.ipynb		test2.ipynb
test2.py		test2.py

phanngoc/docling-rag

Folders and files

Latest commit

History

Repository files navigation

Document Conversion and RAG System

Features

Prerequisites

Installation

Option 1: Local Installation

Option 2: Docker Deployment

Usage

Document Conversion

Figure Export

RAG Web Interface

Chunking Strategies

Development

Project Structure

Adding New Features

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages