Skip to content

silwalumit/lakehouse-to-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Lakehouse-to-RAG Pipeline

A complete end-to-end data pipeline that scrapes websites, processes data through a lakehouse architecture, and provides RAG (Retrieval-Augmented Generation) capabilities.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Scraper   │───▢│   MinIO     │───▢│   Spark     │───▢│   Delta     β”‚
β”‚             β”‚    β”‚  (Storage)  β”‚    β”‚  (ETL)      β”‚    β”‚   Lake      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   RAG API   │◀───│   Chroma    │◀───│ Embeddings  β”‚β—€β”€β”€β”€β”€β”€β”€β”˜
β”‚  (FastAPI)  β”‚    β”‚ (Vector DB) β”‚    β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Features

  • Web Scraping: Robust scraper with crawling capabilities
  • Data Lake: MinIO-based storage with bronze/silver/gold layers
  • ETL Pipeline: Spark-powered transformations with Delta Lake
  • Quality Monitoring: Comprehensive data quality checks
  • Vector Search: Chroma-based embeddings and retrieval
  • RAG API: FastAPI service for question answering
  • Orchestration: Airflow DAGs for pipeline automation
  • Monitoring: Detailed logging and statistics

πŸ“ Project Structure

lakehouse-to-rag/
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ docker-compose.yaml       # Service orchestration
β”œβ”€β”€ Makefile                  # Development commands
β”œβ”€β”€ architecture.png          # System architecture diagram
β”‚
β”œβ”€β”€ src/                      # Source code
β”‚   β”œβ”€β”€ scraper/             # Web scraping module
β”‚   β”‚   β”œβ”€β”€ scraper.py       # Main scraper class
β”‚   β”‚   β”œβ”€β”€ minio_utils.py   # MinIO integration
β”‚   β”‚   β”œβ”€β”€ __main__.py      # CLI interface
β”‚   β”‚   β”œβ”€β”€ requirements.txt # Dependencies
β”‚   β”‚   └── Dockerfile       # Container definition
β”‚   β”‚
β”‚   β”œβ”€β”€ api/                 # FastAPI service
β”‚   β”‚   β”œβ”€β”€ main.py          # API endpoints
β”‚   β”‚   β”œβ”€β”€ requirements.txt # Dependencies
β”‚   β”‚   └── Dockerfile       # Container definition
β”‚   β”‚
β”‚   β”œβ”€β”€ helpers/             # Shared utilities
β”‚   β”‚   β”œβ”€β”€ duckdb_queries.py # Query utilities
β”‚   β”‚   β”œβ”€β”€ delta_queries.py  # Delta table queries
β”‚   β”‚   └── requirements.txt  # Dependencies
β”‚   β”‚
β”‚   └── tests/               # Test suite
β”‚       β”œβ”€β”€ test_scraper.py  # Scraper tests
β”‚       β”œβ”€β”€ test_etl.py      # ETL tests
β”‚       └── test_api.py      # API tests
β”‚
β”œβ”€β”€ airflow/                  # Airflow configuration
β”‚   └── dags/                # Pipeline DAGs
β”‚       β”œβ”€β”€ scrape_etl_dag.py # Main ETL pipeline
β”‚       β”œβ”€β”€ etl_utils.py      # ETL utilities
β”‚       └── requirements.txt  # DAG dependencies
β”‚
β”œβ”€β”€ config/                   # Configuration files
β”‚   β”œβ”€β”€ scraper/             # Scraper configurations
β”‚   β”‚   └── example.yaml     # Example scraper config
β”‚   β”œβ”€β”€ etl/                 # ETL configurations
β”‚   └── api/                 # API configurations
β”‚
β”œβ”€β”€ data/                     # Data storage
β”‚   β”œβ”€β”€ delta/               # Delta Lake tables
β”‚   β”‚   β”œβ”€β”€ bronze/          # Raw data
β”‚   β”‚   β”œβ”€β”€ silver/          # Cleaned data
β”‚   β”‚   └── gold/            # Final data
β”‚   └── lineage/             # Data lineage logs
β”‚
└── tests/                    # Integration tests
    β”œβ”€β”€ test_pipeline.py     # End-to-end tests
    └── test_integration.py  # Service integration tests

πŸ› οΈ Setup

Prerequisites

  • Docker and Docker Compose
  • Python 3.10+
  • 4GB+ RAM available

Quick Start

  1. Clone and setup:

    git clone <repository-url>
    cd lakehouse-to-rag
  2. Start services:

    docker compose up -d
  3. Access services:

Development Setup

  1. Create virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # Linux/Mac
    # or
    .venv\Scripts\activate     # Windows
  2. Install dependencies:

    pip install -r src/scraper/requirements.txt
    pip install -r src/api/requirements.txt
    pip install -r src/helpers/requirements.txt
  3. Run tests:

    make test

πŸ“– Usage

Web Scraping

Using CLI:

# Single page scraping
python -m scraper --url https://example.com \
  --selectors '{"title": "h1", "content": ".main"}'

# Site crawling
python -m scraper --config config/scraper/example.yaml \
  --crawl --max-pages 50 --stats

Using Airflow:

  1. Set Airflow Variables for configuration
  2. Trigger the lakehouse_etl_pipeline DAG
  3. Monitor progress in Airflow UI

Data Pipeline

The ETL pipeline processes data through three stages:

  1. Bronze: Raw scraped data with basic cleaning
  2. Silver: Cleaned and normalized data
  3. Gold: Deduplicated and enriched data

RAG API

Query the knowledge base:

curl -X POST "http://localhost:8001/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic?"}'

πŸ”§ Configuration

Scraper Configuration

The project includes comprehensive configuration examples in sample_config/:

Basic Example (sample_config/example.yaml):

site_url: "https://example.com"
selectors:
  title: "h1, .title, .page-title"
  content: ".content, .main-content, .article-content"
  author: ".author, .byline"
  date: ".date, .published-date"

advanced:
  rate_limit: 1.0
  timeout: 30
  max_retries: 3
  min_content_length: 100
  respect_robots: true
  max_pages: 50

Comprehensive Examples (sample_config/config.yaml):

  • Blog/News sites
  • Documentation sites
  • E-commerce pages
  • Academic papers
  • API documentation
  • Forum/Community sites
  • Portfolio sites

Usage Examples:

# Use basic example (Project Gutenberg)
python -m scraper --config sample_config/example.yaml --crawl

# List available examples
python -m scraper --config sample_config/config.yaml --list-examples

# Use specific Project Gutenberg example
python -m scraper --config sample_config/config.yaml --example gutenberg_classics --crawl

# Use catalog browsing
python -m scraper --config sample_config/config.yaml --example gutenberg_catalog --crawl --max-pages 10

# Use author pages
python -m scraper --config sample_config/config.yaml --example gutenberg_authors --crawl

# Create custom configuration
cp sample_config/example.yaml my_gutenberg_config.yaml
# Edit my_gutenberg_config.yaml with your preferences
python -m scraper --config my_gutenberg_config.yaml --crawl

Test the Configuration:

# Test the Project Gutenberg configuration
python src/scraper/test_config.py

Airflow Variables

Set these in Airflow UI β†’ Admin β†’ Variables:

  • scraper_site_url: Target website URL
  • scraper_selectors: YAML string of CSS selectors
  • scraper_max_pages: Maximum pages to crawl
  • scraper_crawl: Enable/disable crawling

πŸ§ͺ Testing

Unit Tests

# Test scraper
python -m pytest src/tests/test_scraper.py

# Test ETL
python -m pytest src/tests/test_etl.py

# Test API
python -m pytest src/tests/test_api.py

Integration Tests

# End-to-end pipeline test
python -m pytest tests/test_pipeline.py

# Service integration test
python -m pytest tests/test_integration.py

πŸ“Š Monitoring

Data Quality

The pipeline includes comprehensive data quality checks:

  • Record counts at each stage
  • Content length analysis
  • Missing value detection
  • Duplicate identification

Logging

  • Scraper: Progress and error logs
  • ETL: Transformation statistics
  • API: Request/response logs
  • Airflow: Task execution logs

πŸš€ Deployment

Production Setup

  1. Environment Variables:

    export MINIO_ENDPOINT=your-minio-endpoint
    export MINIO_ACCESS_KEY=your-access-key
    export MINIO_SECRET_KEY=your-secret-key
  2. Docker Compose:

    docker compose -f docker-compose.prod.yaml up -d
  3. Monitoring:

    • Set up Prometheus/Grafana for metrics
    • Configure alerting for pipeline failures
    • Monitor resource usage

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Development Guidelines

  • Follow PEP 8 style guide
  • Add type hints to all functions
  • Include docstrings for all classes and methods
  • Write tests for new features
  • Update documentation as needed

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support

  • Issues: Create an issue on GitHub
  • Documentation: Check the docs/ directory
  • Community: Join our discussions

Built with ❀️ for the data community

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published