HistoryHounder

Chat with your browser history using AI.

Python Version Management

This project uses Python 3.12 and uv for dependency management. Follow these steps:

Initial Setup

Install Python 3.12 with uv:
```
uv python install 3.12
```
Pin the project to Python 3.12:
```
uv python pin 3.12
```
Install dependencies:
```
uv sync
```

Running the Project

Always use uv run for all Python operations:

# Run tests
uv run pytest

# Start the server
uv run python -m historyhounder.server

# Run any Python script
uv run python your_script.py

Features

Advanced Q&A System: Uses Instructor + Ollama for type-safe structured responses with prompt engineering for different question types (statistical, temporal, semantic, comparative, factual)
Intelligent Context Optimization: Dynamically adjusts context size and filtering based on question type
Source Relevance Filtering: Post-processes results to ensure only highly relevant sources are displayed
Model Caching: Optimized to prevent redundant model downloads and loading with two-level caching system
Database Isolation: Complete test isolation with configurable database directories via environment variables (HISTORYHOUNDER_VECTOR_STORE_DIR, HISTORYHOUNDER_HISTORY_DB_DIR)
Type Safety: Instructor-based structured output ensures reliable, validated responses from LLM
Privacy-First: All processing happens locally - no data leaves your machine
Web Interface: Easy-to-use browser extension for querying your history
REST API: FastAPI backend following OpenAPI specification with proper error handling
Comprehensive Testing: Full test suite with integration tests (no mocking approach) and complete workspace isolation
MCP Server: Model Context Protocol server for AI model integration with cross-platform browser support

MCP Server for AI Model Integration

HistoryHounder includes a Model Context Protocol (MCP) server that allows AI models to directly access browser history data through a standardized interface.

MCP Server Features

Cross-Platform Browser Support: Chrome, Firefox, Safari, Edge, Brave
Real-Time Data Access: Direct access to browser history databases
Standardized Protocol: MCP 1.12.3 compliant with FastMCP 2.11.0
Filtering Capabilities: Date ranges, domains, visit counts, browser-specific filtering
AI Model Integration: Tools for AI models to access browser context

Available MCP Tools

get_browser_history_tool: Retrieve browser history with filtering
- Filter by browser (chrome, firefox, safari, edge, brave)
- Filter by date range (start_date, end_date)
- Filter by domain
- Limit results (default: 100 items)
get_history_statistics_tool: Get analytics and statistics
- Total history items across all browsers
- Browser distribution
- Date ranges
- Query performance metrics
list_supported_browsers_tool: Check available browsers
- Cross-platform browser detection
- Accessibility status
- Path information

Starting the MCP Server

# Start MCP server on default port (8081)
uv run python -m historyhounder.cli mcp-server

# Start on custom port
uv run python -m historyhounder.cli mcp-server --port 8087

# Start on specific host and port
uv run python -m historyhounder.cli mcp-server --host 0.0.0.0 --port 8087

AI Model Integration Examples

Claude with Browser Context

# AI model can call our tools to get browser history
history_data = mcp_client.call_tool("get_browser_history_tool", {
    "browser": "firefox",
    "limit": 10,
    "start_date": "2025-08-01T00:00:00"
})

GPT with Browsing Analytics

# AI model can analyze browsing patterns
stats = mcp_client.call_tool("get_history_statistics_tool", {})
# AI can then provide insights about browsing habits

Use Cases

AI Assistants: Provide context-aware responses based on user's browsing history
Productivity Tools: Analyze browsing patterns for time management insights
Research Assistants: Help users find previously visited resources
Automation: AI-powered workflows that need browser history context

Privacy & Security

Local Processing: All data processing happens locally
No Data Transmission: Browser history never leaves your machine
Secure Access: Direct database access with proper validation
Cross-Platform: Works on Windows, macOS, and Linux

Architecture

Core Components

History Extraction (extract_chrome_history.py): Extracts browsing history from Chrome and other browsers
Content Fetching (content_fetcher.py): Retrieves and processes web page content with security validation
Vector Store (vector_store.py): ChromaDB-based storage for embeddings with isolation support
Search Engine (search.py): Advanced Q&A with context optimization and source filtering
LLM Integration (llm/ollama_qa.py): Instructor-based structured output with Ollama for type-safe responses
Web Server (server.py): FastAPI backend with configurable database directories
Browser Extension (extension/): Chrome extension for user interface
Database Isolation (conftest.py, environment variables): Complete test isolation system
MCP Server (mcp/): Model Context Protocol server for AI model integration with cross-platform browser support

Data Flow

Browser history → Content fetching → Security validation → Text processing
Text → Embeddings → Vector storage (ChromaDB with isolation)
User query → Question type classification → Context optimization → Instructor/Ollama processing → Type-safe structured response
Response → Source relevance filtering → User interface

Development Philosophy

This project follows a comprehensive testing and integration approach:

Integration Tests: Tests run without mocking when possible for real-world validation
Quality Assurance: All tests must pass (123/123) before committing
Performance Focus: Model caching, context optimization, and efficient vector operations
Security First: Input validation, path sanitization, and comprehensive error handling

Technology Choices

Core Technologies

Python 3.12

Rationale: Latest stable Python with modern features, excellent ecosystem for ML/AI, strong typing support
Benefits: Rich library ecosystem, performance improvements, enhanced error messages, cross-platform compatibility
Requirement: Project requires Python 3.12+ for optimal compatibility with all dependencies

ChromaDB (Vector Database)

Rationale: Local-first, Python-native, excellent for embeddings and metadata
Benefits:
- No external dependencies (runs locally)
- Excellent metadata support
- Persistent client for test isolation
- Active development and community

Sentence Transformers (Embeddings)

Rationale: High-quality, fast, local embedding models
Benefits:
- No API calls required
- Excellent semantic search performance
- Multiple model options (all-MiniLM-L6-v2, etc.)
- Easy to swap models

Instructor (Structured LLM Output)

Rationale: Modern framework for structured output from LLMs, replaces LangChain for better type safety
Benefits:
- Type-safe structured responses with Pydantic models
- Simplified LLM interaction patterns
- Better error handling and validation
- Cleaner, more maintainable code
- Direct integration with Ollama for local processing

Ollama (Local LLM)

Rationale: Easy-to-use local LLM runner, privacy-first
Benefits:
- No data leaves your machine
- Multiple model support (llama3.2:latest default)
- Simple setup and management
- Active development
- Direct integration with Instructor for structured output

FastMCP (Model Context Protocol)

Rationale: Production-ready MCP framework for AI model integration
Benefits:
- Standardized protocol for AI model communication
- Built-in authentication and deployment features
- 90% less boilerplate compared to custom implementations
- Official MCP ecosystem integration
- Cross-platform browser history access for AI models

Content Extraction Technologies

readability-lxml

Rationale: Excellent article content extraction, used by major platforms
Benefits: Robust HTML parsing, handles complex layouts, fallback mechanisms

yt-dlp

Rationale: Most comprehensive video metadata extraction
Benefits: Supports multiple platforms, rich metadata, active maintenance

BeautifulSoup

Rationale: Fallback HTML parsing when readability fails
Benefits: Robust, handles malformed HTML, extensive documentation

Development and Testing

pytest

Rationale: Modern, feature-rich testing framework
Benefits: Fixtures, parametrization, excellent reporting, plugin ecosystem

Database Isolation System

Rationale: Complete separation of test and production data for safety and reliability
Benefits:
- Environment variable configuration (HISTORYHOUNDER_VECTOR_STORE_DIR, HISTORYHOUNDER_HISTORY_DB_DIR)
- Isolated test fixtures with automatic cleanup
- Zero workspace contamination during testing
- Safe parallel test execution

uv (Package Management)

Rationale: Fast, modern Python package manager
Benefits: Faster than pip, better dependency resolution, virtual environment management

requests

Rationale: Simple, reliable HTTP library
Benefits: Excellent documentation, wide adoption, good error handling

Security Technologies

Input Validation

Custom validation functions: URL and file path validation
shlex: Shell command escaping
urllib.parse: URL parsing and validation

File Operations

tempfile: Secure temporary file creation
contextlib: Context managers for resource cleanup
os.path: Safe path operations

Alternative Technologies Considered

Vector Databases

Pinecone: Rejected due to cloud dependency and privacy concerns
Weaviate: Rejected due to complexity and external dependencies
Qdrant: Considered but ChromaDB better suited for local-first approach

Embedding Models

OpenAI Embeddings: Rejected due to API dependency and privacy concerns
Cohere: Rejected due to API dependency
Hugging Face Inference API: Rejected due to API dependency

LLM Frameworks

LangChain: Migrated away from due to complexity and overhead for simple structured output needs
OpenAI GPT: Rejected due to privacy concerns and API dependency
Anthropic Claude: Rejected due to privacy concerns and API dependency
Local models via Hugging Face: Considered but Ollama + Instructor provides better UX and type safety

Content Extraction

newspaper3k: Considered but readability-lxml more robust
trafilatura: Considered but readability-lxml more widely adopted
youtube-dl: Rejected in favor of yt-dlp (more active development)

Performance Considerations

Embedding Performance

Batch processing: Efficient handling of large document sets
Model selection: all-MiniLM-L6-v2 provides good speed/quality balance
Caching: ChromaDB provides efficient similarity search

Content Extraction Performance

Parallel processing: Future enhancement for multiple URLs
Caching: requests-cache for network requests
Timeout handling: Prevents hanging on slow responses

Memory Management

Streaming: Large datasets processed in batches
Cleanup: Proper resource cleanup in all components
Garbage collection: Explicit cleanup in vector store operations

Setup

1. Clone the repo and enter the directory

git clone <repo-url>
cd HistoryHounder

2. Install uv (if not already installed)

pip install uv

3. Create a virtual environment and install dependencies

uv venv
uv pip install -r requirements.txt

4. (Optional) Install and run Ollama for local LLM Q&A

Download and install Ollama from https://ollama.com/
Start Ollama: ollama serve
Pull a model (e.g., ollama pull llama3)

Usage

Extract and Embed Your Browser History

Extract, fetch content, and embed your history from the last 7 days:

uv run python -m historyhounder.cli extract --days 7 --with-content --embed

Limit Number of URLs Processed

You can limit the number of distinct URLs processed using the --url-limit option. This is useful for testing or when you want to process only a subset of your history:

# Process only the first 10 URLs from your history
uv run python -m historyhounder.cli extract --url-limit 10 --with-content

# Process only the first 5 URLs and embed them
uv run python -m historyhounder.cli extract --url-limit 5 --with-content --embed

# Process only the first 3 URLs from the last 30 days
uv run python -m historyhounder.cli extract --days 30 --url-limit 3 --with-content

Domain and URL Pattern Filtering

You can ignore specific domains or URL patterns during extraction, embedding, and search using the --ignore-domain and --ignore-pattern options. Multiple values can be specified in a single argument, separated by commas:

--ignore-domain: Ignore all URLs from specific domains (comma-separated)
--ignore-pattern: Ignore all URLs matching substrings or regex patterns (comma-separated)

Examples:

Ignore all YouTube and Facebook URLs when extracting and embedding:

uv run python -m historyhounder.cli extract --with-content --embed --ignore-domain "youtube.com,facebook.com"

Ignore all URLs containing /ads/ or matching a regex pattern:

uv run python -m historyhounder.cli extract --with-content --embed --ignore-pattern "/ads/,.*tracking.*"

You can combine these options as needed. The same options are available for the search command.

Semantic Search

Search your embedded history with a natural language query:

uv run python -m historyhounder.cli search --query "Shopify AI tools" --top-k 5

LLM Q&A with Ollama

Ask a question and get an answer from your history using a local LLM:

uv run python -m historyhounder.cli search --query "What was that article I read last week about Shopify and AI tools?" --llm ollama --llm-model llama3

You can change --llm-model to any model available in your Ollama installation (e.g., mistral, llama2, etc.)

Backend Server for Browser Extension

Start the backend server to enable browser extension integration:

# Start server on default port (8080)
uv run python -m historyhounder.cli server

# Start server on custom port
uv run python -m historyhounder.cli server --port 9000

# Start server on all interfaces (for remote access)
uv run python -m historyhounder.cli server --host 0.0.0.0 --port 8080

# Alternative: Run server directly
uv run python -m historyhounder.server --port 8080

# Configure Ollama model (default: llama3.2:latest)
HISTORYHOUNDER_OLLAMA_MODEL=llama3.2:latest uv run python -m historyhounder.server --port 8080

The server provides RESTful API endpoints for:

Health Check: GET /api/health
Semantic Search: GET /api/search?q=query
AI Q&A: POST /api/qa
History Processing: POST /api/process-history
Statistics: GET /api/stats

API Documentation:

Swagger UI: http://localhost:8080/docs
ReDoc: http://localhost:8080/redoc
OpenAPI JSON: http://localhost:8080/openapi.json

Environment Variable Configuration

The server supports several environment variables for configuration:

Ollama Model Configuration

# Use default model (llama3.2:latest)
uv run python -m historyhounder.server --port 8080

# Use a specific model
HISTORYHOUNDER_OLLAMA_MODEL=llama3.2:latest uv run python -m historyhounder.server --port 8080

# Use a different model
HISTORYHOUNDER_OLLAMA_MODEL=llama3.1:latest uv run python -m historyhounder.server --port 8080

# Use a custom model
HISTORYHOUNDER_OLLAMA_MODEL=my-custom-model uv run python -m historyhounder.server --port 8080

Database Directory Configuration

# Use custom vector store directory (for testing or isolation)
HISTORYHOUNDER_VECTOR_STORE_DIR=/custom/path/chroma_db uv run python -m historyhounder.server

# Use custom history database directory (for testing or isolation)
HISTORYHOUNDER_HISTORY_DB_DIR=/custom/path/history_db uv run python -m historyhounder.server

# Combine multiple environment variables
HISTORYHOUNDER_OLLAMA_MODEL=llama3.1:latest \
HISTORYHOUNDER_VECTOR_STORE_DIR=/isolated/chroma \
HISTORYHOUNDER_HISTORY_DB_DIR=/isolated/history \
uv run python -m historyhounder.server --port 8080

Available Models:

llama3.2:latest (default) - Latest Llama 3.2 model
llama3.1:latest - Llama 3.1 model
llama3:latest - Llama 3 model
Any other model available in your Ollama installation

Check Current Model:

curl http://localhost:8080/api/health

Get Model Information:

curl http://localhost:8080/api/ollama/model

Browser Extension Integration

The backend server enables the HistoryHounder browser extension to:

Perform semantic search on your browser history
Ask AI questions about your browsing patterns
Process and sync history data
Access enhanced features through the extension UI
The browser extension now ensures that only the required fields (id, url, title, lastVisitTime, visitCount) are sent to the backend when syncing history. This prevents 422 Unprocessable Entity errors from the FastAPI backend.

Troubleshooting

422 Unprocessable Entity Error on /api/process-history: This error means the request body did not match the expected schema. Make sure the extension is up to date and only sends the required fields with the correct types. See the extension's README for more details.

Running Tests

To run all tests:

uv run pytest -v

To run security tests specifically:

uv run pytest tests/test_security.py -v

The integration tests use real, public URLs (externalized in tests/real_world_urls.txt) to ensure the pipeline works as a real user would experience it. No mocking is used in integration tests. YouTube extraction is powered by yt-dlp and the tests are robust to metadata field variations.

Test Coverage

The test suite provides comprehensive coverage including:

Unit Tests: Individual component testing with proper mocking
Integration Tests: End-to-end pipeline testing with real data
Security Tests: Input validation, path traversal, command injection, error handling
Edge Case Tests: Error handling, malformed input, empty results
CLI Tests: Argument parsing, error messages, malformed input
Error Handling Tests: Database failures, network timeouts, corrupted data
Real-world Scenarios: Using actual public URLs and content

Integration (End-to-End) Tests

The test suite includes robust integration tests that simulate the full pipeline: extracting sample browser history, fetching content, embedding, storing in ChromaDB, and performing semantic search.
Integration tests use pytest fixtures to ensure each test uses a unique ChromaDB collection, providing full isolation and preventing cross-test contamination.
Tests now use real datetime objects and edge-case metadata to ensure all metadata is properly converted and stored in ChromaDB. This catches issues with datetime, None, lists, dicts, and other non-primitive types.

Project Structure

historyhounder/
  cli.py                # Main CLI entry point
  history_extractor.py  # Extracts browser history
  content_fetcher.py    # Fetches and extracts content from URLs
  vector_store.py       # Chroma vector DB integration
  embedder/             # Pluggable embedders (sentence-transformers, etc.)
  llm/
    ollama_qa.py        # LangChain Q&A with Ollama
  pipeline.py           # Main orchestration logic
  utils.py              # Utility functions
  ...
tests/                  # All tests (pytest, including integration)
  test_security.py      # Security-focused tests
  real_world_urls.txt   # Real URLs for integration testing
SECURITY.md             # Security guidelines and best practices
requirements.txt         # All dependencies
pytest.ini               # Pytest config (warning filters)
README.md                # This file

Security

HistoryHounder implements comprehensive security measures:

Input Validation: All user inputs are validated and sanitized
Path Security: File operations restricted to safe directories
Subprocess Security: URL validation prevents command injection
Error Handling: Sensitive information removed from error messages
Security Testing: Automated tests for all security measures

See SECURITY.md for detailed security guidelines and best practices.

License

MIT (or your chosen license)

ChromaDB Best Practices: PersistentClient for Test Isolation and Multi-Instance Support

ChromaDB uses a singleton/shared client by default, which does not allow multiple clients with different settings (such as different persist directories) in the same process.
To support integration tests, test isolation, and multiple independent vector stores in the same process, HistoryHounder uses chromadb.PersistentClient instead of chromadb.Client.
This allows each test or pipeline to specify its own persist_directory (via the path argument), ensuring that data is isolated and there are no conflicts between tests or runs.
If you need to use multiple ChromaDB databases or collections in the same process (e.g., for testing, multi-user, or multi-tenant scenarios), always use PersistentClient.
Example usage:

import chromadb
client = chromadb.PersistentClient(path="path/to/chroma_db")
collection = client.get_or_create_collection("history")

This is now the default in historyhounder/vector_store.py and is required for all integration tests to pass.
For more details, see ChromaDB Issue: An instance of Chroma already exists for ... with different settings

TODO - Future Enhancements

🚀 Planned Features

User Experience Improvements

Keyboard Shortcuts: Add keyboard shortcuts for common actions (Ctrl+S for search, etc.)
Search Suggestions: Auto-complete and search suggestions based on history
Search History: Remember and display recent searches
Voice Input: Speech-to-text functionality for AI chat
Custom Themes: Light/dark mode toggle with theme customization
Export Features: Export search results, statistics, or chat history
Bookmark Integration: Connect with browser bookmarks for enhanced search

Advanced Functionality

Advanced Filters: Date range picker, file type filters, visit count filters
Smart Categories: Auto-categorize websites (work, personal, shopping, etc.)
Search Analytics: Track search patterns and popular queries
Offline Mode: Cache data for offline access and search
Batch Operations: Select multiple history items for bulk actions
Search Templates: Save and reuse complex search queries

AI & Analytics Enhancements

Personalized Insights: AI-generated insights about browsing patterns
Predictive Search: Suggest searches based on time of day and patterns
Content Analysis: Analyze page content for better categorization
Usage Analytics: Detailed analytics dashboard with charts and graphs
Export Reports: Generate PDF/CSV reports of browsing statistics

Integration & Connectivity

Cross-Browser Sync: Sync settings and data across browsers
Cloud Backup: Optional cloud backup of settings and preferences
API Integration: Connect with external services (Notion, Obsidian, etc.)
Webhook Support: Send notifications to external services
Mobile Companion: Mobile app for viewing statistics and insights

Performance & Technical

Lazy Loading: Implement lazy loading for large history datasets
Search Indexing: Optimize search performance with better indexing
Memory Management: Improve memory usage for large history files
Background Sync: Automatic background synchronization
Progressive Web App: PWA capabilities for standalone use

Accessibility & Internationalization

Screen Reader Support: Enhanced accessibility for visually impaired users
High Contrast Mode: Dedicated high contrast theme
Internationalization: Multi-language support (i18n)
Keyboard Navigation: Full keyboard navigation support
Voice Commands: Voice control for hands-free operation

Security & Privacy

End-to-End Encryption: Encrypt sensitive data
Privacy Controls: Granular privacy settings and data controls
Data Anonymization: Option to anonymize data for analytics
Audit Log: Track data access and usage
GDPR Compliance: Full GDPR compliance features

🎯 Priority Levels

High Priority (Next Release)

Medium Priority (Future Releases)

Low Priority (Long-term)

🤝 Contributing to TODO Items

We welcome contributions! If you'd like to work on any of these features:

Check the Issues: Look for existing issues related to the feature
Create a Proposal: Open an issue describing your implementation plan
Follow Guidelines: Ensure your code follows our coding standards
Test Thoroughly: Include tests for new functionality
Document Changes: Update documentation for new features

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
extension		extension
historyhounder		historyhounder
tests		tests
.cursorrules		.cursorrules
.gitignore		.gitignore
LICENSE		LICENSE
Post.md		Post.md
README.md		README.md
SECURITY.md		SECURITY.md
extension_integration_test.html		extension_integration_test.html
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

pkmishra/HistoryHound

Folders and files

Latest commit

History

Repository files navigation

HistoryHounder

Python Version Management

Initial Setup

Running the Project

Features

MCP Server for AI Model Integration

MCP Server Features

Available MCP Tools

Starting the MCP Server

AI Model Integration Examples

Claude with Browser Context

GPT with Browsing Analytics

Use Cases

Privacy & Security

Architecture

Core Components

Data Flow

Development Philosophy

Technology Choices

Core Technologies

Python 3.12

ChromaDB (Vector Database)

Sentence Transformers (Embeddings)

Instructor (Structured LLM Output)

Ollama (Local LLM)

FastMCP (Model Context Protocol)

Content Extraction Technologies

readability-lxml

yt-dlp

BeautifulSoup

Development and Testing

pytest

Database Isolation System

uv (Package Management)

requests

Security Technologies

Input Validation

File Operations

Alternative Technologies Considered

Vector Databases

Embedding Models

LLM Frameworks

Content Extraction

Performance Considerations

Embedding Performance

Content Extraction Performance

Memory Management

Setup

1. Clone the repo and enter the directory

2. Install uv (if not already installed)

3. Create a virtual environment and install dependencies

4. (Optional) Install and run Ollama for local LLM Q&A

Usage

Extract and Embed Your Browser History

Limit Number of URLs Processed

Domain and URL Pattern Filtering

Semantic Search

LLM Q&A with Ollama

Backend Server for Browser Extension

Environment Variable Configuration

Ollama Model Configuration

Database Directory Configuration

Browser Extension Integration

Troubleshooting

Running Tests

Test Coverage

Integration (End-to-End) Tests

Project Structure

Security

License

ChromaDB Best Practices: PersistentClient for Test Isolation and Multi-Instance Support

TODO - Future Enhancements

🚀 Planned Features

User Experience Improvements

Packages