This project provides a document processing and Retrieval Augmented Generation (RAG) system for extracting, chunking, and querying information from various document formats.
- Convert PDF, HTML, Markdown, and image documents
- Extract text, tables, and images from documents
- Process and chunk documents using various strategies (Character, Recursive, Semantic)
- Store document chunks in a vector database (Qdrant)
- Query document content using natural language with RAG
- Visualize and export document elements
- Web interface for document upload and interactive Q&A
- Python 3.10+
- Docker and Docker Compose (for containerized deployment)
- OpenAI API key (for LLM capabilities)
-
Clone the repository:
git clone <repository-url> cd doc-convert
-
Install dependencies:
pip install -r requirements.txt
-
Run Qdrant server (required for vector storage):
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
-
Set up environment variables: Create a
.envfile in the project root with:OPENAI_API_KEY=your_openai_api_key
-
Clone the repository:
git clone <repository-url> cd doc-convert
-
Create a
.envfile with:OPENAI_API_KEY=your_openai_api_key -
Run the application stack:
docker-compose up
Convert PDF documents to Markdown using the test1.py script:
python test1.pyThis will:
- Download a sample PDF from arXiv
- Convert it to Markdown format
- Save the output to
./output/directory
Extract figures and tables from PDF documents using the figure-export.ipynb notebook:
- Open the notebook in Jupyter:
jupyter notebook figure-export.ipynb
- Run the cells to extract and save images from the PDF
Use the Streamlit web application to process documents and ask questions:
-
Start the application:
streamlit run rag_streamlit_app.py
-
Access the web interface at http://localhost:8501
-
Enter your OpenAI API key (if not set in .env)
-
Input a document URL to process
-
Ask questions about the document content
The system supports multiple chunking strategies:
- Character Chunking: Divides text based on a fixed number of characters
- Recursive Character Chunking: Divides text recursively until a condition is met
- Document Specific Chunking: Respects document structure (paragraphs, sections)
- Semantic Chunking: Groups text by semantic relationships
- Token-based Chunking: Divides text based on token count
rag_streamlit_app.py: Main Streamlit web applicationtest1.py: PDF to Markdown converter scripttest2.pyandtest2.ipynb: Document chunking and vector search examplesfigure-export.ipynb: Extract figures and tables from PDFsrequirements.txt: Python dependenciesDockerfileanddocker-compose.yml: Containerization configuration
-
New Document Formats:
- Extend the
DocumentConverterclass with additional format handlers - Add the format to
allowed_formatsin converter initialization
- Extend the
-
Custom Chunking Strategies:
- Create a new chunker class extending base chunkers
- Implement the
chunkmethod
-
Improving RAG:
- Modify the LangChain components in
setup_langchain_ragfunction - Adjust parameters in the retriever or document chain
- Modify the LangChain components in
- Memory Issues: For large documents, increase Docker memory limits or process documents in smaller batches
- PDF Conversion Issues: Check PDF permissions or try alternative URLs
- Qdrant Connectivity: Ensure Qdrant server is running and accessible
[Specify License]