MCP (Model Context Protocol) server for converting HTML webpages to clean Markdown format. Reduces HTML size by ~90-95% while preserving tables, images, and important content - perfect for AI context.
- Converts HTML from URLs to clean Markdown
- Preserves tables, images, and links
- Removes unnecessary elements (scripts, styles, navigation, footers, headers)
- Significant size reduction (typically 90-95% compression)
- Configurable options for images, tables, and links
- Built with
trafilaturaandBeautifulSoup4for robust extraction - Stream processing for efficient handling of large pages
- Size limits to prevent downloading excessively large content (1MB-50MB)
- Optional caching to speed up repeated conversions of the same URLs
- 🌐 Browser mode with Playwright - Handles JavaScript-heavy sites and authenticated pages
- Execute JavaScript (perfect for SPAs: React, Vue, Angular)
- Use your browser profile with cookies (access authenticated pages!)
- Support for Chrome, Firefox, WebKit
- Configurable wait strategies for dynamic content
- Python 3.10 or higher
uvpackage manager (recommended) orpip
# Clone the repository
git clone <your-repo-url>
cd html2md
# Install dependencies
uv pip install -e .
# Install Playwright browsers (required for browser mode)
playwright install chromium# Clone the repository
git clone <your-repo-url>
cd html2md
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e .
# Install Playwright browsers (required for browser mode)
playwright install chromiumThe easiest way to use html2md is with Docker:
# Build the image
docker build -t html2md .
# Or use pre-built image (when published)
docker pull your-registry/html2md:latestFor Claude Desktop, configure with Docker:
{
"mcpServers": {
"html2md": {
"command": "docker",
"args": [
"run",
"-i",
"--rm",
"html2md"
]
}
}
}Docker Image Features:
- Pre-installed Playwright with Chromium
- Optimized for minimal size (~1GB)
- Non-root user for security
- Ready to use - no additional setup required
Add the server to your Claude Desktop configuration file:
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/html2md",
"run",
"html2md"
]
}
}
}Edit %APPDATA%/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"C:\\absolute\\path\\to\\html2md",
"run",
"html2md"
]
}
}
}Edit ~/.config/Claude/claude_desktop_config.json:
{
"mcpServers": {
"html2md": {
"command": "uv",
"args": [
"--directory",
"/absolute/path/to/html2md",
"run",
"html2md"
]
}
}
}Once configured, the MCP server will be available in Claude Desktop. You can use the html_to_markdown tool:
Convert this webpage to markdown: https://example.com/article
Use the html_to_markdown tool with:
- url: https://example.com/docs
- include_images: false
- include_tables: true
Use the html_to_markdown tool with:
- url: https://spa-application.com
- fetch_method: playwright
- wait_for: networkidle
Use the html_to_markdown tool with:
- url: https://private-site.com/dashboard
- fetch_method: playwright
- use_user_profile: true
- browser_type: chromium
Note: For use_user_profile=true, make sure Chrome is closed before running.
Basic Parameters:
url(required): URL of the webpage to convertinclude_images(optional, default: true): Include images in Markdowninclude_tables(optional, default: true): Include tables in Markdowninclude_links(optional, default: true): Include links in Markdowntimeout(optional, default: 30): Request timeout in seconds (5-120)
Performance Parameters:
max_size(optional, default: 10MB): Maximum size of content to download in bytes (1MB-50MB)use_cache(optional, default: false): Enable caching for faster repeated conversionscache_ttl(optional, default: 3600): Cache time-to-live in seconds (60-86400)
Browser Mode Parameters:
fetch_method(optional, default: "fetch"): Fetch method - "fetch" (fast) or "playwright" (handles JS, auth)browser_type(optional, default: "chromium"): Browser to use - "chromium", "firefox", or "webkit"headless(optional, default: true): Run browser in headless modewait_for(optional, default: "networkidle"): Wait strategy - "load", "domcontentloaded", or "networkidle"use_user_profile(optional, default: false): Use your browser profile with cookies (requires Chrome closed)
uv pip install -e ".[dev]"pytest# Format with black
black src/ tests/
# Lint with ruff
ruff check src/ tests/mypy src/The project consists of three main modules:
Core HTML to Markdown conversion functionality:
fetch_html(): Downloads HTML from URLclean_html(): Removes unnecessary elements with BeautifulSoupconvert_to_markdown(): Converts cleaned HTML to Markdown with trafilaturahtml_to_markdown(): Main workflow combining all steps
MCP server implementation:
- Registers the
html_to_markdowntool - Handles tool calls and error responses
- Runs async MCP server with stdio transport
Utility functions:
- Hash calculation for caching
- Text formatting and truncation
- Domain extraction
- Filename sanitization
In-memory caching system:
SimpleCacheclass with TTL support- Global cache instance management
- Automatic expiration of old entries
- Hash-based cache keys for URL + parameters
Playwright browser automation:
fetch_html_playwright()- Async browser-based HTML fetching- Support for Chromium, Firefox, WebKit
- User profile integration for authenticated access
- Configurable wait strategies for dynamic content
- Check that the path in
claude_desktop_config.jsonis absolute and correct - Restart Claude Desktop completely
- Check Claude Desktop logs for errors
# Verify Python version
python --version # Should be 3.10+
# Try reinstalling dependencies
uv pip install --force-reinstall -e .- Timeout errors: Increase the
timeoutparameter - Empty content: Some websites may block automated requests or use JavaScript rendering
- Solution: Use
fetch_method: playwrightto execute JavaScript
- Solution: Use
- Parse errors: The webpage structure may be unusual or malformed
- Content too large: Increase the
max_sizeparameter (up to 50MB) or the page exceeds limits - Cache issues: Disable caching with
use_cache: falseif you need fresh content
- Playwright not installed: Run
playwright install chromium - Browser launch fails: Check that you have sufficient permissions and disk space
- User profile error: Make sure Chrome is completely closed before using
use_user_profile: true - Page doesn't load fully: Try different
wait_forstrategies:"load"- fastest, waits for page load event"domcontentloaded"- waits for DOM to be ready"networkidle"- slowest but most reliable, waits for network to be idle
- Authentication not working: Ensure you're using
browser_type: chromiumanduse_user_profile: true
Typical conversion results:
- Original HTML: ~500KB - 2MB
- Markdown output: ~25KB - 100KB
- Compression: 90-95%
- Processing time: 2-10 seconds (depending on page size and network)
MIT
Contributions are welcome! Please feel free to submit issues or pull requests.
Built with:
- MCP SDK - Model Context Protocol
- trafilatura - Web content extraction
- BeautifulSoup4 - HTML parsing