RAG Search Engine

A comprehensive Retrieval-Augmented Generation (RAG) search engine built in Python, following the Boot.dev RAG course. This project demonstrates advanced search techniques including semantic search, keyword search with BM25, hybrid search, and LLM-augmented generation.

🚀 Features

Semantic Search: Uses sentence transformers for vector-based similarity search
Keyword Search: Implements BM25 scoring with inverted indexes
Hybrid Search: Combines semantic and keyword search using multiple fusion strategies
Text Chunking: Fixed-size and semantic chunking for improved retrieval
Augmented Generation: Integrates with Google's Gemini API for RAG responses
Evaluation Framework: Metrics for precision, recall, and relevance scoring
CLI Interface: Command-line tools for each search strategy

📋 Prerequisites

Python >= 3.12
Google Gemini API key (for augmented generation features)

🛠️ Installation

Clone the repository:

git clone <repository-url>
cd rag-search-engine

Install dependencies:

pip install -e .

Set up environment variables:

# Create a .env file with your Gemini API key
echo "GEMINI_API_KEY=your_api_key_here" > .env

Download the movies dataset:

# Create data directory and download movies dataset
mkdir -p data
curl -o data/movies.json https://storage.googleapis.com/qvault-webapp-dynamic-assets/course_assets/course-rag-movies.json

📖 Usage

Semantic Search

# Verify the model
python cli/semantic_search_cli.py verify

# Search for movies
python cli/semantic_search_cli.py search "sci-fi space adventure"

# Chunk text for better semantic retrieval
python cli/semantic_search_cli.py semantic_chunk "Your long text here" --max-chunk-size 3

# Search with chunked embeddings
python cli/semantic_search_cli.py search_chunked "artificial intelligence" --limit 10

Keyword Search

# Build inverted index
python cli/keyword_search_cli.py build

# Search with BM25
python cli/keyword_search_cli.py search "action thriller"

# Test BM25 parameters
python cli/keyword_search_cli.py bm25_test "your query" --k1 1.5 --b 0.75

Hybrid Search

# Weighted fusion of semantic and keyword search
python cli/hybrid_search_cli.py weighted "fantasy adventure" --alpha 0.6

# Reciprocal Rank Fusion (RRF)
python cli/hybrid_search_cli.py rrf "comedy romance" --k 60

# Convex Combination
python cli/hybrid_search_cli.py convex "drama mystery" --alpha 0.4

Augmented Generation

# RAG - search and generate answer
python cli/augmented_generation_cli.py rag "What are the best sci-fi movies about AI?"

# Summarize multiple documents
python cli/augmented_generation_cli.py summarize "comedy movies from the 90s" --limit 3

# Get citations
python cli/augmented_generation_cli.py citations "action movies with car chases" --limit 5

Evaluation

# Run evaluation against golden dataset
python cli/evaluation_cli.py --limit 5

🏗️ Project Structure

rag-search-engine/
├── cli/                        # Command-line interfaces
│   ├── lib/                   # Core search logic
│   │   ├── semantic_search.py # Semantic search implementation
│   │   ├── keyword_search.py  # BM25 keyword search
│   │   ├── hybrid_search.py   # Hybrid search strategies
│   │   └── search_utils.py    # Shared utilities
│   ├── semantic_search_cli.py # Semantic search CLI
│   ├── keyword_search_cli.py  # Keyword search CLI
│   ├── hybrid_search_cli.py   # Hybrid search CLI
│   ├── augmented_generation_cli.py # RAG CLI
│   ├── evaluation_cli.py     # Evaluation CLI
│   └── test_gemini.py        # Gemini API testing
├── data/                      # Dataset files (gitignored)
├── cache/                     # Cached embeddings/indexes (gitignored)
├── pyproject.toml             # Python dependencies
└── AGENTS.md                 # Development guidelines

🔧 Search Strategies

Semantic Search

Uses all-MiniLM-L6-v2 sentence transformer model
Cosine similarity for document ranking
Supports both full-document and chunked search
Caches embeddings for performance

Keyword Search

BM25 algorithm with configurable parameters (k1, b)
Inverted index with caching
Text preprocessing with stopwords and stemming
Term frequency and document length normalization

Hybrid Search

Weighted Fusion: Linear combination of semantic and keyword scores
Reciprocal Rank Fusion (RRF): Combines rankings with reciprocal scoring
Convex Combination: Normalized score fusion
Configurable fusion parameters

Text Chunking

Fixed-size chunking: Word-based chunks with overlap
Semantic chunking: Sentence-based chunks preserving context
Configurable chunk sizes and overlap

📊 Evaluation Metrics

Precision@k: Fraction of retrieved documents that are relevant
Recall@k: Fraction of relevant documents that are retrieved
Support for custom evaluation datasets

🤖 LLM Integration

Uses Google's Gemini 2.0 Flash model for:

Context-aware answer generation
Document summarization
Citation-based responses
Query understanding and expansion

🧪 Testing

# Test Gemini API connection
python cli/test_gemini.py

# Test individual components
python -c "from cli.lib.semantic_search import *; verify_model()"

📚 Learning Resources

This project follows the Boot.dev "Learn Retrieval Augmented Generation" course, which covers:

Text preprocessing and normalization
TF-IDF and inverted indexes
BM25 keyword search
Semantic search with embeddings
Document chunking strategies
Hybrid search techniques
LLM integration and augmentation
Reranking algorithms
Evaluation metrics
Multimodal search

🔗 Related Links

Boot.dev RAG Course

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cli		cli
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Search Engine

🚀 Features

📋 Prerequisites

🛠️ Installation

📖 Usage

Semantic Search

Keyword Search

Hybrid Search

Augmented Generation

Evaluation

🏗️ Project Structure

🔧 Search Strategies

Semantic Search

Keyword Search

Hybrid Search

Text Chunking

📊 Evaluation Metrics

🤖 LLM Integration

🧪 Testing

📚 Learning Resources

🔗 Related Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Search Engine

🚀 Features

📋 Prerequisites

🛠️ Installation

📖 Usage

Semantic Search

Keyword Search

Hybrid Search

Augmented Generation

Evaluation

🏗️ Project Structure

🔧 Search Strategies

Semantic Search

Keyword Search

Hybrid Search

Text Chunking

📊 Evaluation Metrics

🤖 LLM Integration

🧪 Testing

📚 Learning Resources

🔗 Related Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages