A Flask-based web application that enables chat with a local LLM (via Ollama) enhanced with Retrieval-Augmented Generation (RAG) using ChromaDB vector storage and PDF documents.
- Local LLM Integration: Uses Ollama to run local language models
- RAG Enhancement: Retrieves relevant context from PDF documents to improve responses
- Vector Storage: Uses ChromaDB for efficient document storage and retrieval
- Streaming Responses: Real-time streaming of LLM responses
- Chat History: Persistent chat history with download functionality
- Modern UI: Clean, responsive web interface
- Python 3.8+
- Ollama installed and running locally
- Download from: https://ollama.ai/
- Install and start Ollama service
- Pull a model:
ollama pull llama3.2(or any other model)
-
Clone or download this repository
-
Navigate to the RAG directory:
cd RAG -
Install Python dependencies:
pip install -r requirements.txt
-
Add your PDF documents:
- Place your PDF files in the
documents/folder - The app will automatically process them on first run
- Place your PDF files in the
-
Start the Flask server:
python server.py
-
Open your web browser and go to:
http://localhost:5001 -
Start chatting!
- Type your questions in the chat interface
- The app will retrieve relevant context from your PDF documents
- Responses are enhanced with the retrieved information
- PDFs in the
documents/folder are automatically loaded and chunked - Text chunks are embedded using HuggingFace's
sentence-transformers/all-MiniLM-L6-v2 - Embeddings are stored in ChromaDB for fast retrieval
- User sends a question
- System retrieves relevant document chunks from ChromaDB
- Context is combined with the user's question
- Enhanced prompt is sent to Ollama LLM
- Response is streamed back to the user
RAG/
├── server.py # Flask web server
├── rag_utils.py # RAG utilities (ChromaDB, embeddings)
├── index.html # Web interface
├── requirements.txt # Python dependencies
├── documents/ # PDF files for RAG
│ ├── document1.pdf
│ ├── document2.pdf
│ └── ...
├── chroma_db/ # ChromaDB storage (auto-created)
└── chat_history.json # Chat history storage
Edit server.py and modify the model name:
payload = {
'model': 'your-model-name', # Change this
'prompt': prompt,
'stream': True
}In rag_utils.py, you can modify:
chunk_size: Size of text chunks (default: 1000)chunk_overlap: Overlap between chunks (default: 100)k: Number of retrieved documents (default: 4)
Change the embedding model in rag_utils.py:
EMBED_MODEL = "your-embedding-model"- Ensure Ollama is running:
ollama serve - Check if your model is available:
ollama list - Verify Ollama API is accessible at
http://localhost:11434
- Ensure PDFs are readable and not corrupted
- Check file permissions in the
documents/folder - Large PDFs may take time to process on first run
- Reduce
chunk_sizeinrag_utils.pyfor large documents - Use a smaller embedding model if needed
- Consider using GPU for embeddings if available
- Flask: Web framework
- LangChain: LLM orchestration
- ChromaDB: Vector database
- HuggingFace: Embedding models
- PyPDF2: PDF processing
- Requests: HTTP client
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is open source and available under the MIT License.
For issues and questions:
- Check the troubleshooting section above
- Ensure all dependencies are properly installed
- Verify Ollama is running and accessible