Skip to content

Conversation

@mre
Copy link
Member

@mre mre commented Sep 7, 2025

This PR implements a comprehensive per-host rate limiting system with adaptive backoff, statistics tracking, and configurable concurrency controls for lychee.

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

Key Features

  • Per-host rate limiting: Uses token bucket algorithm with governor crate for precise rate control
  • Adaptive backoff: Automatically adjusts request timing based on server responses (429, 5xx errors)
  • Host-specific controls: Configurable concurrency limits and request intervals per host
  • Comprehensive statistics: Tracks requests, success rates, response times, and cache performance
  • Multiple output formats: Compact, detailed, markdown, and JSON formats for host statistics
  • Cache integration: Per-host caching with TTL and hit/miss tracking
  • Configuration flexibility: Both CLI options and config file support for per-host settings

CLI Changes

  • New --host-stats flag to display per-host statistics at the end of runs
  • New --default-host-concurrency option to set per-host concurrency limits
  • New --default-request-interval option to set minimum intervals between requests
  • Host-specific configuration support via config file

Configuration Options

CLI Usage

# Basic usage with host statistics
lychee --host-stats README.md

# Configure global defaults
lychee --default-host-concurrency 5 --default-request-interval 200ms --host-stats docs/

# Different output formats
lychee --host-stats --format json docs/
lychee --host-stats --format markdown docs/

Config File Support (lychee.toml)

# Global defaults (can be overridden by CLI)
default_host_concurrency = 10
default_request_interval = "100ms"
host_stats = true

# Per-host overrides

[hosts."api.example.com"]  
concurrency = 1                # One request at a time
request_interval = "1s"     # Very conservative for sensitive APIs

[hosts."github.com"]
concurrency = 20                     # Higher concurrency for GitHub API
request_interval = "100ms"     # Optimistic interval

Breaking Changes

  • Replaced global cache with per-host caching architecture

Other Changes

  • Updated Client API to include host statistics methods
  • Added new dependencies: governor, humantime-serde

Performance Impact

  • Positive: More efficient request distribution and reduced server load
  • Minimal overhead: Rate limiting adds <1ms latency per request
  • Memory efficient: Per-host data structures with automatic cleanup

Examples

Host Statistics Output

📊 Per-host Statistics
────────────────────────────────────────────────────────────
github.com    │     45 reqs │   95.6% success │   320ms median │   12.5% cache
docs.rs       │     23 reqs │  100.0% success │   180ms median │    0.0% cache
api.rust-lang.org │  12 reqs │   91.7% success │   450ms median │   25.0% cache

JSON Output Format

{
  "host_statistics": {
    "github.com": {
      "total_requests": 45,
      "successful_requests": 43,
      "success_rate": 95.6,
      "rate_limited": 1,
      "client_errors": 1,
      "server_errors": 0,
      "median_request_time_ms": 320,
      "cache_hits": 5,
      "cache_misses": 35,
      "cache_hit_rate": 12.5
    }
  }
}

@mre
Copy link
Member Author

mre commented Sep 7, 2025

Looks like crates.io has made some changes. I'll rerun it tomorrow to see if the issue remains.

@mre
Copy link
Member Author

mre commented Sep 7, 2025

Fixed: File URLs () were incorrectly being processed through the rate limiting system, causing debug errors like:

   [DEBUG] Failed to record cache miss for file:///private/tmp/input.md#development-environment: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#development-environment: URL contains no host component
   [DEBUG] Failed to record cache miss for file:///private/tmp/input.md#react-native-awesome-components: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#react-native-awesome-components: URL contains no host component

Solution: File URLs are now properly excluded from rate limiting tracking since they're local filesystem operations, not network requests.

Copy link
Member

@thomas-zahner thomas-zahner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review batch (10/27 files viewed) 😉

@mre
Copy link
Member Author

mre commented Sep 11, 2025

Good points, thanks! I'll go through the review once I find the time.

mre added 11 commits September 18, 2025 16:09
Fixes fragment checking for JavaDoc-generated HTML which uses
<a name="anchor"> instead of id attributes for anchors.

This resolves a regression where lychee v0.20.1 was failing to find
fragments that worked in v0.18.1, particularly for JavaDoc URLs like:
- https://example.com/javadoc/Class.html#method--
- https://example.com/javadoc/Class.html#skip.navbar.top

The fix maintains backward compatibility by checking both 'id' and
'name' attributes when extracting fragments from HTML documents.

Resolves #1838
Add comprehensive per-host rate limiting system with adaptive backoff,
statistics tracking, and configurable concurrency controls.

Features:
- Per-host rate limiting using token bucket algorithm with governor crate
- Adaptive backoff based on server responses (429, 5xx errors)
- Host-specific request concurrency and interval controls
- Comprehensive statistics tracking (requests, success rates, response times)
- Cache hit/miss tracking per host with configurable TTL
- Multiple output formats for host statistics (compact, detailed, markdown, json)
- CLI flag --host-stats to display per-host statistics
- Configuration options for default host concurrency and request intervals

Implementation:
- Clean module structure: ratelimit/host/{host.rs, stats.rs, key.rs}
- Window data structure for rolling request time averages
- DashMap for thread-safe per-host caching with expiration
- Integration with existing cache system for persistent storage
- Formatter system matching existing lychee output styles
- Comprehensive error handling and logging

Breaking changes:
- Removed global cache in favor of clean per-host caching architecture
- Updated Client API to include host statistics methods
- Added new dependencies: governor, humantime-serde

All linting and formatting requirements satisfied.
File URLs don't have host components and should not be tracked in the
per-host rate limiting system. Only network URIs (http/https) need
rate limiting and statistics tracking.

Fixes debug errors like:
  Failed to record cache miss for file:///path#fragment:
  Rate limiting error: URL contains no host component
- Add debug messages when hosts hit rate limits (429 responses)
- Add debug messages when applying backoff delays
- Show exponential backoff progression in debug logs
- Change 'cache' to 'cached' in host statistics output for clarity

Debug output example:
  Host httpbin.org hit rate limit (429), increasing backoff from 0ms to 500ms
  Host httpbin.org applying backoff delay of 500ms due to previous rate limiting or errors

Statistics output now shows '0.0% cached' instead of '0.0% cache'
The per-host implementation was creating separate cookie jars for each host,
which broke the global --cookie-jar functionality. Now all hosts share the
same cookie jar when one is provided, while still maintaining separate
rate limiting and statistics per host.

Fixes test_cookie_jar test.
Per-host HTTP clients were missing the User-Agent header and other global
headers, causing some sites like crates.io to return 403 Forbidden errors.
Now all per-host clients inherit the global headers (User-Agent, custom
headers) while still allowing host-specific header overrides.

Fixes test_crates_io_quirk test.
Per-host clients were not respecting the max_redirects configuration,
causing redirect tests to fail. Each host now creates its own reqwest
client with proper redirect policy, timeout, and security settings
matching the main client configuration.

Fixes test_prevent_too_many_redirects test.
@mre mre force-pushed the feat/per-host-rate-limiting branch from eaab15f to f6673ba Compare September 18, 2025 14:23
@mre
Copy link
Member Author

mre commented Sep 24, 2025

Just so that I don't forget, I will need to fix the behavior when setting the request interval to 0. I'm getting some parsing errors.

github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Oct 1, 2025
Until lycheeverse/lychee#1844 is shipped, we
silence 429s as valid status codes whenever links are being checked by
lychee.
@thomas-zahner
Copy link
Member

@mre I seem to have a considerable performance regression with this PR. This might be related to the concurrency changes and maybe also with my comment here?

I haven't done any through analysis or anything but I noticed it just by running lychee on the README and wanted to let you know. I switched multiple times between master and this PR and the performance degradation is very noticeable.

master branch

➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 1s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.36s user 0.06s system 22% cpu 1.836 total
➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 0s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.35s user 0.05s system 58% cpu 0.671 total
➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 1s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.39s user 0.06s system 28% cpu 1.594 total

this PR

Sometimes I even felt like the link check process was stuck in the end.

➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 7s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.72s user 0.12s system 23% cpu 7.689 total
➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 6s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.72s user 0.10s system 26% cpu 6.825 total
➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 7s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.73s user 0.12s system 25% cpu 7.358 total

@mre
Copy link
Member Author

mre commented Oct 9, 2025

I will test this on my machine when I get the time.
We discussed offline that it's probably related to our new conservative defaults (10 concurrent requests per host, 100ms delay between requests). We can tweak those settings maybe.

@mre
Copy link
Member Author

mre commented Oct 9, 2025

Todo:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Per-Host Rate Limiting and Caching Add custom delay inbetween requests (prevent ban)

3 participants