Skip to content

gabyrod7/rag-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Search Engine

A comprehensive Retrieval-Augmented Generation (RAG) search engine built in Python, following the Boot.dev RAG course. This project demonstrates advanced search techniques including semantic search, keyword search with BM25, hybrid search, and LLM-augmented generation.

🚀 Features

  • Semantic Search: Uses sentence transformers for vector-based similarity search
  • Keyword Search: Implements BM25 scoring with inverted indexes
  • Hybrid Search: Combines semantic and keyword search using multiple fusion strategies
  • Text Chunking: Fixed-size and semantic chunking for improved retrieval
  • Augmented Generation: Integrates with Google's Gemini API for RAG responses
  • Evaluation Framework: Metrics for precision, recall, and relevance scoring
  • CLI Interface: Command-line tools for each search strategy

📋 Prerequisites

  • Python >= 3.12
  • Google Gemini API key (for augmented generation features)

🛠️ Installation

  1. Clone the repository:
git clone <repository-url>
cd rag-search-engine
  1. Install dependencies:
pip install -e .
  1. Set up environment variables:
# Create a .env file with your Gemini API key
echo "GEMINI_API_KEY=your_api_key_here" > .env
  1. Download the movies dataset:
# Create data directory and download movies dataset
mkdir -p data
curl -o data/movies.json https://storage.googleapis.com/qvault-webapp-dynamic-assets/course_assets/course-rag-movies.json

📖 Usage

Semantic Search

# Verify the model
python cli/semantic_search_cli.py verify

# Search for movies
python cli/semantic_search_cli.py search "sci-fi space adventure"

# Chunk text for better semantic retrieval
python cli/semantic_search_cli.py semantic_chunk "Your long text here" --max-chunk-size 3

# Search with chunked embeddings
python cli/semantic_search_cli.py search_chunked "artificial intelligence" --limit 10

Keyword Search

# Build inverted index
python cli/keyword_search_cli.py build

# Search with BM25
python cli/keyword_search_cli.py search "action thriller"

# Test BM25 parameters
python cli/keyword_search_cli.py bm25_test "your query" --k1 1.5 --b 0.75

Hybrid Search

# Weighted fusion of semantic and keyword search
python cli/hybrid_search_cli.py weighted "fantasy adventure" --alpha 0.6

# Reciprocal Rank Fusion (RRF)
python cli/hybrid_search_cli.py rrf "comedy romance" --k 60

# Convex Combination
python cli/hybrid_search_cli.py convex "drama mystery" --alpha 0.4

Augmented Generation

# RAG - search and generate answer
python cli/augmented_generation_cli.py rag "What are the best sci-fi movies about AI?"

# Summarize multiple documents
python cli/augmented_generation_cli.py summarize "comedy movies from the 90s" --limit 3

# Get citations
python cli/augmented_generation_cli.py citations "action movies with car chases" --limit 5

Evaluation

# Run evaluation against golden dataset
python cli/evaluation_cli.py --limit 5

🏗️ Project Structure

rag-search-engine/
├── cli/                        # Command-line interfaces
│   ├── lib/                   # Core search logic
│   │   ├── semantic_search.py # Semantic search implementation
│   │   ├── keyword_search.py  # BM25 keyword search
│   │   ├── hybrid_search.py   # Hybrid search strategies
│   │   └── search_utils.py    # Shared utilities
│   ├── semantic_search_cli.py # Semantic search CLI
│   ├── keyword_search_cli.py  # Keyword search CLI
│   ├── hybrid_search_cli.py   # Hybrid search CLI
│   ├── augmented_generation_cli.py # RAG CLI
│   ├── evaluation_cli.py     # Evaluation CLI
│   └── test_gemini.py        # Gemini API testing
├── data/                      # Dataset files (gitignored)
├── cache/                     # Cached embeddings/indexes (gitignored)
├── pyproject.toml             # Python dependencies
└── AGENTS.md                 # Development guidelines

🔧 Search Strategies

Semantic Search

  • Uses all-MiniLM-L6-v2 sentence transformer model
  • Cosine similarity for document ranking
  • Supports both full-document and chunked search
  • Caches embeddings for performance

Keyword Search

  • BM25 algorithm with configurable parameters (k1, b)
  • Inverted index with caching
  • Text preprocessing with stopwords and stemming
  • Term frequency and document length normalization

Hybrid Search

  • Weighted Fusion: Linear combination of semantic and keyword scores
  • Reciprocal Rank Fusion (RRF): Combines rankings with reciprocal scoring
  • Convex Combination: Normalized score fusion
  • Configurable fusion parameters

Text Chunking

  • Fixed-size chunking: Word-based chunks with overlap
  • Semantic chunking: Sentence-based chunks preserving context
  • Configurable chunk sizes and overlap

📊 Evaluation Metrics

  • Precision@k: Fraction of retrieved documents that are relevant
  • Recall@k: Fraction of relevant documents that are retrieved
  • Support for custom evaluation datasets

🤖 LLM Integration

Uses Google's Gemini 2.0 Flash model for:

  • Context-aware answer generation
  • Document summarization
  • Citation-based responses
  • Query understanding and expansion

🧪 Testing

# Test Gemini API connection
python cli/test_gemini.py

# Test individual components
python -c "from cli.lib.semantic_search import *; verify_model()"

📚 Learning Resources

This project follows the Boot.dev "Learn Retrieval Augmented Generation" course, which covers:

  1. Text preprocessing and normalization
  2. TF-IDF and inverted indexes
  3. BM25 keyword search
  4. Semantic search with embeddings
  5. Document chunking strategies
  6. Hybrid search techniques
  7. LLM integration and augmentation
  8. Reranking algorithms
  9. Evaluation metrics
  10. Multimodal search

🔗 Related Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages