Skip to content

Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant vector search with FastAPI and MCP interfaces for AI engineers.

License

Notifications You must be signed in to change notification settings

BjornMelin/ai-docs-vector-db-hybrid-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Documentation Vector Database Hybrid Scraper

Python 3.11+ Vector DB: Qdrant License: MIT

AI-focused documentation ingestion and retrieval stack that combines Firecrawl and Crawl4AI powered scraping with a Qdrant vector database. The project exposes both FastAPI and MCP interfaces, offers mode-aware configuration (solo developer vs enterprise feature sets), and ships with tooling for embeddings, hybrid search, retrieval-augmented generation (RAG) workflows, and operational monitoring.

Overview

The system ingests documentation sources, generates dense and sparse embeddings, stores them in Qdrant, and serves hybrid search and RAG building blocks. It is built for AI engineers who need reliable documentation ingestion pipelines, reproducible retrieval quality, and integration points for agents or applications.

Highlights

  • Multi-tier crawling orchestration (src/services/browser/unified_manager.py) covering lightweight HTTP, Crawl4AI, browser-use, Playwright, and Firecrawl, plus a resumable bulk embedder CLI (src/crawl4ai_bulk_embedder.py).
  • Hybrid retrieval stack leveraging OpenAI and FastEmbed embeddings, SPLADE sparse vectors, reranking, and HyDE augmentation through the modular Qdrant service (src/services/vector_db/ and src/services/hyde/).
  • Dual interfaces: REST endpoints in FastAPI (src/api/routers/simple/) and a FastMCP server (src/unified_mcp_server.py) that registers search, document management, analytics, and content intelligence tools for Claude Desktop / Code.
  • Observability built in: Prometheus instrumentation, structured logging, health checks, optional Dragonfly cache + ARQ worker, and configuration-driven monitoring (src/services/monitoring/).
  • Developer ergonomics with uv-managed environments, dependency-injector driven service wiring, Ruff + pytest quality gates, and a unified developer CLI (scripts/dev.py).

Table of Contents

Architecture

flowchart LR
    subgraph clients["Clients"]
        mcp["Claude Desktop / MCP"]
        rest["REST / CLI clients"]
    end

    subgraph api["FastAPI application"]
        router["Mode-aware routers"]
        factory["Service factory"]
    end

    subgraph processing["Processing layer"]
        crawl["Unified crawling manager"]
        embed["Embedding manager"]
        search["Hybrid retrieval"]
        queue["ARQ task queue"]
    end

    subgraph data["Storage & caching"]
        qdrant[("Qdrant vector DB")]
        redis[("Redis / Dragonfly cache")]
        storage["Local docs & artifacts"]
    end

    subgraph observability["Observability"]
        metrics["Prometheus exporter"]
        health["Health & diagnostics"]
    end

    mcp --> api
    rest --> api
    api --> processing
    processing --> crawl
    processing --> embed
    processing --> search
    processing --> queue
    crawl --> firecrawl["Firecrawl API"]
    crawl --> crawl4ai["Crawl4AI"]
    crawl --> browseruse["browser-use / Playwright"]
    embed --> openai["OpenAI"]
    embed --> fastembed["FastEmbed / FlagEmbedding"]
    search --> qdrant
    queue --> redis
    processing --> redis
    api --> metrics
    metrics --> observability
    processing --> health
    health --> observability
Loading

Core Components

Crawling & Ingestion

  • UnifiedBrowserManager selects the right automation tier and tracks quality metrics.
  • Firecrawl and Crawl4AI adapters plus browser-use / Playwright integrations cover static and dynamic sites.
  • src/crawl4ai_bulk_embedder.py streams bulk ingestion, chunking, and embedding into Qdrant with resumable state and progress reporting.
  • docs/users/web-scraping.md and docs/users/examples-and-recipes.md include tier selection guidance and code samples.

Vector Search & Retrieval

  • src/services/vector_db/ wraps collection management, hybrid search orchestration, adaptive fusion, and payload indexing.
  • Dense embeddings via OpenAI or FastEmbed, optional sparse vectors via SPLADE, and reranking hooks are configurable through Pydantic models (src/config/models.py).
  • HyDE augmentation and caching live under src/services/hyde/, enabling query expansion for RAG pipelines.
  • Search responses return timing, scoring metadata, and diagnostics suitable for observability dashboards.

Interfaces & Tooling

  • FastAPI routes (/api/v1/search, /api/v1/documents, /api/v1/collections) expose the core ingestion and retrieval capabilities.
  • The FastMCP server (src/unified_mcp_server.py) registers search, document, embedding, scraping, analytics, cache, and content intelligence tool modules (src/mcp_tools/).
  • Developer CLI (scripts/dev.py) manages services, testing profiles, benchmarks, linting, and type checking.
  • Example notebooks and scripts under examples/ demonstrate agentic RAG flows and advanced search orchestration.

Observability & Operations

  • Prometheus metrics and health endpoints instrument both the API and MCP servers; see config/prometheus.yml and docs/operators/monitoring.md.
  • Optional Dragonfly cache, PostgreSQL, ARQ workers, and Grafana dashboards are provisioned via docker-compose.yml profiles.
  • Structured logging and rate limiting middleware are wired through the service factory and CORS/middleware managers (src/services/fastapi/middleware/).

Quick Start

Prerequisites

  • Python 3.11 (or 3.12) and uv for dependency management.
  • A running Qdrant instance (local Docker welcome: docker compose --profile simple up -d qdrant).
  • API keys for the providers you plan to use (e.g., OPENAI_API_KEY, AI_DOCS__FIRECRAWL__API_KEY).

Environment variables

Variable Purpose Example
AI_DOCS__MODE Selects simple or enterprise service wiring. AI_DOCS__MODE=enterprise
AI_DOCS__QDRANT__URL Points services at your Qdrant instance. http://localhost:6333
OPENAI_API_KEY Enables OpenAI embeddings and HyDE prompts. sk-...
AI_DOCS__FIRECRAWL__API_KEY Authenticates Firecrawl API usage. fc-...
AI_DOCS__CACHE__REDIS_URL Enables Dragonfly/Redis caching layers. redis://localhost:6379
FASTMCP_TRANSPORT Chooses MCP transport (streamable-http or stdio). streamable-http
FASTMCP_HOST / FASTMCP_PORT Hostname and port for MCP HTTP transport. 0.0.0.0 / 8001
FASTMCP_BUFFER_SIZE Tunes MCP stream buffer size (bytes). 8192

Store secrets in a .env file or your secrets manager and export them before running the services.

Clone & Install

git clone https://github.com/BjornMelin/ai-docs-vector-db-hybrid-scraper
cd ai-docs-vector-db-hybrid-scraper
uv sync --dev

Run the FastAPI application

# Ensure Qdrant is reachable at http://localhost:6333
export OPENAI_API_KEY="sk-..."                 # optional if using OpenAI
export AI_DOCS__FIRECRAWL__API_KEY="fc-..."    # optional but recommended
uv run python -m src.api.main

Visit http://localhost:8000/docs for interactive OpenAPI docs. Default mode is simple; set AI_DOCS__MODE=enterprise to enable the enterprise service stack.

Run the MCP server

uv run python src/unified_mcp_server.py

The server validates configuration on startup and registers the available MCP tools. Configure Claude Desktop / Code with the generated transport details (see config/claude-mcp-config.example.json).

  1. Copy config/claude-mcp-config.example.json to your Claude settings directory and update the command field if you use a virtual environment wrapper.
  2. If you prefer HTTP transport, export FASTMCP_TRANSPORT=streamable-http and set FASTMCP_HOST/FASTMCP_PORT to match the values referenced in the Claude config.
  3. Restart Claude Desktop / Code so it reloads the MCP manifest and tool list.

Bulk ingestion CLI

uv run python src/crawl4ai_bulk_embedder.py --help

Use CSV/JSON/TXT URL lists to scrape, chunk, embed, and upsert into Qdrant with resumable checkpoints.

Docker Compose

  • Simple profile (API + Qdrant): docker compose --profile simple up -d
  • Enterprise profile (adds Dragonfly, PostgreSQL, worker, Prometheus, Grafana): docker compose --profile enterprise up -d

Stop with docker compose down when finished.

Configuration

  • Configuration is defined with Pydantic models in src/config/models.py and can be overridden via environment variables (AI_DOCS__*) or YAML files in config/templates/.
  • Mode-aware settings enable or disable services such as advanced caching, A/B testing, and observability.
  • Detailed configuration guidance lives in docs/developers/configuration.md and operator runbooks under docs/operators/.

Testing & Quality

# Quick unit + fast integration tests
python scripts/dev.py test --profile quick

# Full suite with coverage (mirrors CI)
python scripts/dev.py test --profile ci

# Lint, format, type-check, and tests in one pass
python scripts/dev.py quality

Performance and benchmark suites are available via python scripts/dev.py benchmark, and chaos/security suites live under tests/ with dedicated markers.

Documentation & Resources

  • User guides: docs/users/ (quick start, search, scraping recipes, troubleshooting).
  • Developer deep dives: docs/developers/ (API reference, integration, architecture).
  • Operator handbook: docs/operators/ (deployment, monitoring, security).
  • Research notes and experiments: docs/research/.

Publishable MkDocs output is generated under site/ when running the documentation pipeline.

Contributing

Contributions are welcome. Read the CONTRIBUTING.md guide for development workflow, coding standards, and review expectations. Please include tests and documentation updates with feature changes. If this stack accelerates your RAG pipelines, consider starring the repository so other developers can discover it.

License

Released under the MIT License.

About

Retrieval-augmented docs ingestion stack: Firecrawl + Crawl4AI + Qdrant vector search with FastAPI and MCP interfaces for AI engineers.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages