A comprehensive Retrieval-Augmented Generation (RAG) search engine built in Python, following the Boot.dev RAG course. This project demonstrates advanced search techniques including semantic search, keyword search with BM25, hybrid search, and LLM-augmented generation.
- Semantic Search: Uses sentence transformers for vector-based similarity search
- Keyword Search: Implements BM25 scoring with inverted indexes
- Hybrid Search: Combines semantic and keyword search using multiple fusion strategies
- Text Chunking: Fixed-size and semantic chunking for improved retrieval
- Augmented Generation: Integrates with Google's Gemini API for RAG responses
- Evaluation Framework: Metrics for precision, recall, and relevance scoring
- CLI Interface: Command-line tools for each search strategy
- Python >= 3.12
- Google Gemini API key (for augmented generation features)
- Clone the repository:
git clone <repository-url>
cd rag-search-engine- Install dependencies:
pip install -e .- Set up environment variables:
# Create a .env file with your Gemini API key
echo "GEMINI_API_KEY=your_api_key_here" > .env- Download the movies dataset:
# Create data directory and download movies dataset
mkdir -p data
curl -o data/movies.json https://storage.googleapis.com/qvault-webapp-dynamic-assets/course_assets/course-rag-movies.json# Verify the model
python cli/semantic_search_cli.py verify
# Search for movies
python cli/semantic_search_cli.py search "sci-fi space adventure"
# Chunk text for better semantic retrieval
python cli/semantic_search_cli.py semantic_chunk "Your long text here" --max-chunk-size 3
# Search with chunked embeddings
python cli/semantic_search_cli.py search_chunked "artificial intelligence" --limit 10# Build inverted index
python cli/keyword_search_cli.py build
# Search with BM25
python cli/keyword_search_cli.py search "action thriller"
# Test BM25 parameters
python cli/keyword_search_cli.py bm25_test "your query" --k1 1.5 --b 0.75# Weighted fusion of semantic and keyword search
python cli/hybrid_search_cli.py weighted "fantasy adventure" --alpha 0.6
# Reciprocal Rank Fusion (RRF)
python cli/hybrid_search_cli.py rrf "comedy romance" --k 60
# Convex Combination
python cli/hybrid_search_cli.py convex "drama mystery" --alpha 0.4# RAG - search and generate answer
python cli/augmented_generation_cli.py rag "What are the best sci-fi movies about AI?"
# Summarize multiple documents
python cli/augmented_generation_cli.py summarize "comedy movies from the 90s" --limit 3
# Get citations
python cli/augmented_generation_cli.py citations "action movies with car chases" --limit 5# Run evaluation against golden dataset
python cli/evaluation_cli.py --limit 5rag-search-engine/
├── cli/ # Command-line interfaces
│ ├── lib/ # Core search logic
│ │ ├── semantic_search.py # Semantic search implementation
│ │ ├── keyword_search.py # BM25 keyword search
│ │ ├── hybrid_search.py # Hybrid search strategies
│ │ └── search_utils.py # Shared utilities
│ ├── semantic_search_cli.py # Semantic search CLI
│ ├── keyword_search_cli.py # Keyword search CLI
│ ├── hybrid_search_cli.py # Hybrid search CLI
│ ├── augmented_generation_cli.py # RAG CLI
│ ├── evaluation_cli.py # Evaluation CLI
│ └── test_gemini.py # Gemini API testing
├── data/ # Dataset files (gitignored)
├── cache/ # Cached embeddings/indexes (gitignored)
├── pyproject.toml # Python dependencies
└── AGENTS.md # Development guidelines
- Uses
all-MiniLM-L6-v2sentence transformer model - Cosine similarity for document ranking
- Supports both full-document and chunked search
- Caches embeddings for performance
- BM25 algorithm with configurable parameters (k1, b)
- Inverted index with caching
- Text preprocessing with stopwords and stemming
- Term frequency and document length normalization
- Weighted Fusion: Linear combination of semantic and keyword scores
- Reciprocal Rank Fusion (RRF): Combines rankings with reciprocal scoring
- Convex Combination: Normalized score fusion
- Configurable fusion parameters
- Fixed-size chunking: Word-based chunks with overlap
- Semantic chunking: Sentence-based chunks preserving context
- Configurable chunk sizes and overlap
- Precision@k: Fraction of retrieved documents that are relevant
- Recall@k: Fraction of relevant documents that are retrieved
- Support for custom evaluation datasets
Uses Google's Gemini 2.0 Flash model for:
- Context-aware answer generation
- Document summarization
- Citation-based responses
- Query understanding and expansion
# Test Gemini API connection
python cli/test_gemini.py
# Test individual components
python -c "from cli.lib.semantic_search import *; verify_model()"This project follows the Boot.dev "Learn Retrieval Augmented Generation" course, which covers:
- Text preprocessing and normalization
- TF-IDF and inverted indexes
- BM25 keyword search
- Semantic search with embeddings
- Document chunking strategies
- Hybrid search techniques
- LLM integration and augmentation
- Reranking algorithms
- Evaluation metrics
- Multimodal search