AI-focused documentation ingestion and retrieval stack that combines Firecrawl and Crawl4AI powered scraping with a Qdrant vector database. The project exposes both FastAPI and MCP interfaces, offers mode-aware configuration (solo developer vs enterprise feature sets), and ships with tooling for embeddings, hybrid search, retrieval-augmented generation (RAG) workflows, and operational monitoring.
The system ingests documentation sources, generates dense and sparse embeddings, stores them in Qdrant, and serves hybrid search and RAG building blocks. It is built for AI engineers who need reliable documentation ingestion pipelines, reproducible retrieval quality, and integration points for agents or applications.
- Multi-tier crawling orchestration (
src/services/browser/unified_manager.py
) covering lightweight HTTP, Crawl4AI, browser-use, Playwright, and Firecrawl, plus a resumable bulk embedder CLI (src/crawl4ai_bulk_embedder.py
). - Hybrid retrieval stack leveraging OpenAI and FastEmbed embeddings, SPLADE
sparse vectors, reranking, and HyDE augmentation through the modular Qdrant
service (
src/services/vector_db/
andsrc/services/hyde/
). - Dual interfaces: REST endpoints in FastAPI (
src/api/routers/simple/
) and a FastMCP server (src/unified_mcp_server.py
) that registers search, document management, analytics, and content intelligence tools for Claude Desktop / Code. - Observability built in: Prometheus instrumentation, structured logging,
health checks, optional Dragonfly cache + ARQ worker, and configuration-driven
monitoring (
src/services/monitoring/
). - Developer ergonomics with uv-managed environments, dependency-injector driven
service wiring, Ruff + pytest quality gates, and a unified developer CLI
(
scripts/dev.py
).
- Overview
- Highlights
- Architecture
- Core Components
- Quick Start
- Configuration
- Testing & Quality
- Documentation & Resources
- Contributing
- License
flowchart LR
subgraph clients["Clients"]
mcp["Claude Desktop / MCP"]
rest["REST / CLI clients"]
end
subgraph api["FastAPI application"]
router["Mode-aware routers"]
factory["Service factory"]
end
subgraph processing["Processing layer"]
crawl["Unified crawling manager"]
embed["Embedding manager"]
search["Hybrid retrieval"]
queue["ARQ task queue"]
end
subgraph data["Storage & caching"]
qdrant[("Qdrant vector DB")]
redis[("Redis / Dragonfly cache")]
storage["Local docs & artifacts"]
end
subgraph observability["Observability"]
metrics["Prometheus exporter"]
health["Health & diagnostics"]
end
mcp --> api
rest --> api
api --> processing
processing --> crawl
processing --> embed
processing --> search
processing --> queue
crawl --> firecrawl["Firecrawl API"]
crawl --> crawl4ai["Crawl4AI"]
crawl --> browseruse["browser-use / Playwright"]
embed --> openai["OpenAI"]
embed --> fastembed["FastEmbed / FlagEmbedding"]
search --> qdrant
queue --> redis
processing --> redis
api --> metrics
metrics --> observability
processing --> health
health --> observability
- UnifiedBrowserManager selects the right automation tier and tracks quality metrics.
- Firecrawl and Crawl4AI adapters plus browser-use / Playwright integrations cover static and dynamic sites.
src/crawl4ai_bulk_embedder.py
streams bulk ingestion, chunking, and embedding into Qdrant with resumable state and progress reporting.docs/users/web-scraping.md
anddocs/users/examples-and-recipes.md
include tier selection guidance and code samples.
src/services/vector_db/
wraps collection management, hybrid search orchestration, adaptive fusion, and payload indexing.- Dense embeddings via OpenAI or FastEmbed, optional sparse vectors via SPLADE, and reranking hooks are configurable through Pydantic models (
src/config/models.py
). - HyDE augmentation and caching live under
src/services/hyde/
, enabling query expansion for RAG pipelines. - Search responses return timing, scoring metadata, and diagnostics suitable for observability dashboards.
- FastAPI routes (
/api/v1/search
,/api/v1/documents
,/api/v1/collections
) expose the core ingestion and retrieval capabilities. - The FastMCP server (
src/unified_mcp_server.py
) registers search, document, embedding, scraping, analytics, cache, and content intelligence tool modules (src/mcp_tools/
). - Developer CLI (
scripts/dev.py
) manages services, testing profiles, benchmarks, linting, and type checking. - Example notebooks and scripts under
examples/
demonstrate agentic RAG flows and advanced search orchestration.
- Prometheus metrics and health endpoints instrument both the API and MCP servers; see
config/prometheus.yml
anddocs/operators/monitoring.md
. - Optional Dragonfly cache, PostgreSQL, ARQ workers, and Grafana dashboards are provisioned via
docker-compose.yml
profiles. - Structured logging and rate limiting middleware are wired through the service factory and CORS/middleware managers (
src/services/fastapi/middleware/
).
- Python 3.11 (or 3.12) and uv for dependency management.
- A running Qdrant instance (local Docker welcome:
docker compose --profile simple up -d qdrant
). - API keys for the providers you plan to use (e.g.,
OPENAI_API_KEY
,AI_DOCS__FIRECRAWL__API_KEY
).
Variable | Purpose | Example |
---|---|---|
AI_DOCS__MODE |
Selects simple or enterprise service wiring. | AI_DOCS__MODE=enterprise |
AI_DOCS__QDRANT__URL |
Points services at your Qdrant instance. | http://localhost:6333 |
OPENAI_API_KEY |
Enables OpenAI embeddings and HyDE prompts. | sk-... |
AI_DOCS__FIRECRAWL__API_KEY |
Authenticates Firecrawl API usage. | fc-... |
AI_DOCS__CACHE__REDIS_URL |
Enables Dragonfly/Redis caching layers. | redis://localhost:6379 |
FASTMCP_TRANSPORT |
Chooses MCP transport (streamable-http or stdio ). |
streamable-http |
FASTMCP_HOST / FASTMCP_PORT |
Hostname and port for MCP HTTP transport. | 0.0.0.0 / 8001 |
FASTMCP_BUFFER_SIZE |
Tunes MCP stream buffer size (bytes). | 8192 |
Store secrets in a .env
file or your secrets manager and export them before
running the services.
git clone https://github.com/BjornMelin/ai-docs-vector-db-hybrid-scraper
cd ai-docs-vector-db-hybrid-scraper
uv sync --dev
# Ensure Qdrant is reachable at http://localhost:6333
export OPENAI_API_KEY="sk-..." # optional if using OpenAI
export AI_DOCS__FIRECRAWL__API_KEY="fc-..." # optional but recommended
uv run python -m src.api.main
Visit http://localhost:8000/docs
for interactive OpenAPI docs. Default mode is simple
; set AI_DOCS__MODE=enterprise
to enable the enterprise service stack.
uv run python src/unified_mcp_server.py
The server validates configuration on startup and registers the available MCP tools. Configure Claude Desktop / Code with the generated transport details (see config/claude-mcp-config.example.json
).
- Copy
config/claude-mcp-config.example.json
to your Claude settings directory and update thecommand
field if you use a virtual environment wrapper. - If you prefer HTTP transport, export
FASTMCP_TRANSPORT=streamable-http
and setFASTMCP_HOST
/FASTMCP_PORT
to match the values referenced in the Claude config. - Restart Claude Desktop / Code so it reloads the MCP manifest and tool list.
uv run python src/crawl4ai_bulk_embedder.py --help
Use CSV/JSON/TXT URL lists to scrape, chunk, embed, and upsert into Qdrant with resumable checkpoints.
- Simple profile (API + Qdrant):
docker compose --profile simple up -d
- Enterprise profile (adds Dragonfly, PostgreSQL, worker, Prometheus, Grafana):
docker compose --profile enterprise up -d
Stop with docker compose down
when finished.
- Configuration is defined with Pydantic models in
src/config/models.py
and can be overridden via environment variables (AI_DOCS__*
) or YAML files inconfig/templates/
. - Mode-aware settings enable or disable services such as advanced caching, A/B testing, and observability.
- Detailed configuration guidance lives in
docs/developers/configuration.md
and operator runbooks underdocs/operators/
.
# Quick unit + fast integration tests
python scripts/dev.py test --profile quick
# Full suite with coverage (mirrors CI)
python scripts/dev.py test --profile ci
# Lint, format, type-check, and tests in one pass
python scripts/dev.py quality
Performance and benchmark suites are available via python scripts/dev.py benchmark
, and chaos/security suites live under tests/
with dedicated markers.
- User guides:
docs/users/
(quick start, search, scraping recipes, troubleshooting). - Developer deep dives:
docs/developers/
(API reference, integration, architecture). - Operator handbook:
docs/operators/
(deployment, monitoring, security). - Research notes and experiments:
docs/research/
.
Publishable MkDocs output is generated under site/
when running the documentation pipeline.
Contributions are welcome. Read the CONTRIBUTING.md guide for development workflow, coding standards, and review expectations. Please include tests and documentation updates with feature changes. If this stack accelerates your RAG pipelines, consider starring the repository so other developers can discover it.
Released under the MIT License.