This repository provides a system for extracting information from research papers and answering questions based on their content. It leverages vector embeddings and a Qdrant vector database for efficient information retrieval, combined with a summarization agent to generate concise answers. This system was developed to assist in the process of writing a systematic literature review thesis.
- PDF Processing: Automatically ingests PDF documents from a specified directory.
- Text Embedding: Converts document chunks into numerical vector representations using
sentence-transformers/all-MiniLM-L6-v2
. - Vector Database (Qdrant): Stores and retrieves document embeddings for fast similarity searches.
- Question Answering: Takes a list of predefined questions and retrieves relevant document snippets.
- Summary Generation: Utilizes a
SummaryAgent
to synthesize retrieved information into coherent answers. - Output: Appends generated answers and original questions to an
output.txt
file. - Database Management: Clears the Qdrant database upon completion.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
-
Clone the repository:
git clone https://github.com/Peppe-elefante/AI-SLR.git
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
The requirements.txt file includes:
qdrant_client fastembed google-generativeai docling dotenv
-
Set up environment variables: Create a
.env
file in the root directory of your project.GOOGLE_API_KEY = Your google api_key QDRANT_URL = your Qdrant URL
-
Place your PDF studies: Inside the folder named
Studi
place all the studies needed for your SLR -
Define your questions: Ensure your
utils/questions.py
file contains a list of strings, where each string is a question you want the system to answer. For example:# utils/questions.py questions = [ "What is the main finding of the first study?", "How does the second study approach data analysis?", "What are the limitations mentioned in the third paper?" ]
-
Run the main script:
python main.py
After execution, the output.txt
file in the root directory will contain the generated answers for each question.