feat: implement per-host rate limiting and statistics #1844

mre · 2025-09-07T18:49:49Z

This PR implements a comprehensive per-host rate limiting system with adaptive backoff, statistics tracking, and configurable concurrency controls for lychee.

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

Key Features

Per-host rate limiting: Uses token bucket algorithm with governor crate for precise rate control
Adaptive backoff: Automatically adjusts request timing based on server responses (429, 5xx errors)
Host-specific controls: Configurable concurrency limits and request intervals per host
Comprehensive statistics: Tracks requests, success rates, response times, and cache performance
Multiple output formats: Compact, detailed, markdown, and JSON formats for host statistics
Cache integration: Per-host caching with TTL and hit/miss tracking
Configuration flexibility: Both CLI options and config file support for per-host settings

CLI Changes

New --host-stats flag to display per-host statistics at the end of runs
New --default-host-concurrency option to set per-host concurrency limits
New --default-request-interval option to set minimum intervals between requests
Host-specific configuration support via config file

Configuration Options

CLI Usage

# Basic usage with host statistics
lychee --host-stats README.md

# Configure global defaults
lychee --default-host-concurrency 5 --default-request-interval 200ms --host-stats docs/

# Different output formats
lychee --host-stats --format json docs/
lychee --host-stats --format markdown docs/

Config File Support (`lychee.toml`)

# Global defaults (can be overridden by CLI)
default_host_concurrency = 10
default_request_interval = "100ms"
host_stats = true

# Per-host overrides

[hosts."api.example.com"]  
concurrency = 1                # One request at a time
request_interval = "1s"     # Very conservative for sensitive APIs

[hosts."github.com"]
concurrency = 20                     # Higher concurrency for GitHub API
request_interval = "100ms"     # Optimistic interval

Breaking Changes

Replaced global cache with per-host caching architecture

Other Changes

Updated Client API to include host statistics methods
Added new dependencies: governor, humantime-serde

Performance Impact

Positive: More efficient request distribution and reduced server load
Minimal overhead: Rate limiting adds <1ms latency per request
Memory efficient: Per-host data structures with automatic cleanup

Examples

Host Statistics Output

📊 Per-host Statistics
────────────────────────────────────────────────────────────
github.com    │     45 reqs │   95.6% success │   320ms median │   12.5% cache
docs.rs       │     23 reqs │  100.0% success │   180ms median │    0.0% cache
api.rust-lang.org │  12 reqs │   91.7% success │   450ms median │   25.0% cache

JSON Output Format

{
  "host_statistics": {
    "github.com": {
      "total_requests": 45,
      "successful_requests": 43,
      "success_rate": 95.6,
      "rate_limited": 1,
      "client_errors": 1,
      "server_errors": 0,
      "median_request_time_ms": 320,
      "cache_hits": 5,
      "cache_misses": 35,
      "cache_hit_rate": 12.5
    }
  }
}

mre · 2025-09-07T19:35:30Z

Looks like crates.io has made some changes. I'll rerun it tomorrow to see if the issue remains.

mre · 2025-09-07T20:00:59Z

Fixed: File URLs () were incorrectly being processed through the rate limiting system, causing debug errors like:

   [DEBUG] Failed to record cache miss for file:///private/tmp/input.md#development-environment: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#development-environment: URL contains no host component
   [DEBUG] Failed to record cache miss for file:///private/tmp/input.md#react-native-awesome-components: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#react-native-awesome-components: URL contains no host component

Solution: File URLs are now properly excluded from rate limiting tracking since they're local filesystem operations, not network requests.

thomas-zahner

First review batch (10/27 files viewed) 😉

README.md

lychee-bin/src/client.rs

lychee-bin/src/formatters/host_stats/compact.rs

README.md

lychee-bin/src/options.rs

lychee-lib/src/checker/website.rs

mre · 2025-09-11T22:34:46Z

Good points, thanks! I'll go through the review once I find the time.

Fixes fragment checking for JavaDoc-generated HTML which uses <a name="anchor"> instead of id attributes for anchors. This resolves a regression where lychee v0.20.1 was failing to find fragments that worked in v0.18.1, particularly for JavaDoc URLs like: - https://example.com/javadoc/Class.html#method-- - https://example.com/javadoc/Class.html#skip.navbar.top The fix maintains backward compatibility by checking both 'id' and 'name' attributes when extracting fragments from HTML documents. Resolves #1838

Add comprehensive per-host rate limiting system with adaptive backoff, statistics tracking, and configurable concurrency controls. Features: - Per-host rate limiting using token bucket algorithm with governor crate - Adaptive backoff based on server responses (429, 5xx errors) - Host-specific request concurrency and interval controls - Comprehensive statistics tracking (requests, success rates, response times) - Cache hit/miss tracking per host with configurable TTL - Multiple output formats for host statistics (compact, detailed, markdown, json) - CLI flag --host-stats to display per-host statistics - Configuration options for default host concurrency and request intervals Implementation: - Clean module structure: ratelimit/host/{host.rs, stats.rs, key.rs} - Window data structure for rolling request time averages - DashMap for thread-safe per-host caching with expiration - Integration with existing cache system for persistent storage - Formatter system matching existing lychee output styles - Comprehensive error handling and logging Breaking changes: - Removed global cache in favor of clean per-host caching architecture - Updated Client API to include host statistics methods - Added new dependencies: governor, humantime-serde All linting and formatting requirements satisfied.

File URLs don't have host components and should not be tracked in the per-host rate limiting system. Only network URIs (http/https) need rate limiting and statistics tracking. Fixes debug errors like: Failed to record cache miss for file:///path#fragment: Rate limiting error: URL contains no host component

- Add debug messages when hosts hit rate limits (429 responses) - Add debug messages when applying backoff delays - Show exponential backoff progression in debug logs - Change 'cache' to 'cached' in host statistics output for clarity Debug output example: Host httpbin.org hit rate limit (429), increasing backoff from 0ms to 500ms Host httpbin.org applying backoff delay of 500ms due to previous rate limiting or errors Statistics output now shows '0.0% cached' instead of '0.0% cache'

The per-host implementation was creating separate cookie jars for each host, which broke the global --cookie-jar functionality. Now all hosts share the same cookie jar when one is provided, while still maintaining separate rate limiting and statistics per host. Fixes test_cookie_jar test.

Per-host HTTP clients were missing the User-Agent header and other global headers, causing some sites like crates.io to return 403 Forbidden errors. Now all per-host clients inherit the global headers (User-Agent, custom headers) while still allowing host-specific header overrides. Fixes test_crates_io_quirk test.

Per-host clients were not respecting the max_redirects configuration, causing redirect tests to fail. Each host now creates its own reqwest client with proper redirect policy, timeout, and security settings matching the main client configuration. Fixes test_prevent_too_many_redirects test.

…e example formatting in README

--default-host-concurrency -> --host-concurrency --default-request-interval -> --request-interval

mre · 2025-09-24T11:47:43Z

Just so that I don't forget, I will need to fix the behavior when setting the request interval to 0. I'm getting some parsing errors.

Until lycheeverse/lychee#1844 is shipped, we silence 429s as valid status codes whenever links are being checked by lychee.

thomas-zahner · 2025-10-03T09:06:43Z

@mre I seem to have a considerable performance regression with this PR. This might be related to the concurrency changes and maybe also with my comment here?

I haven't done any through analysis or anything but I noticed it just by running lychee on the README and wanted to let you know. I switched multiple times between master and this PR and the performance degradation is very noticeable.

master branch

➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 1s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.36s user 0.06s system 22% cpu 1.836 total
➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 0s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.35s user 0.05s system 58% cpu 0.671 total
➜  lychee git:(master) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                         🔍 111 Total (in 1s) ✅ 104 OK 🚫 0 Errors 🔀 7 Redirects
./target/debug/lychee README.md  0.39s user 0.06s system 28% cpu 1.594 total

this PR

Sometimes I even felt like the link check process was stuck in the end.

➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 7s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.72s user 0.12s system 23% cpu 7.689 total
➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 6s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.72s user 0.10s system 26% cpu 6.825 total
➜  lychee git:(feat/per-host-rate-limiting) time ./target/debug/lychee README.md
   [WARN ] Error creating request: InvalidPathToUri("/CONTRIBUTING.md")
  111/111 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                                                                                                🔍 111 Total (in 7s) ✅ 111 OK 🚫 0 Errors
./target/debug/lychee README.md  1.73s user 0.12s system 25% cpu 7.358 total

mre · 2025-10-09T08:30:37Z

I will test this on my machine when I get the time.
We discussed offline that it's probably related to our new conservative defaults (10 concurrent requests per host, 100ms delay between requests). We can tweak those settings maybe.

mre · 2025-10-09T08:35:24Z

Todo:

rebase
Address https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264
Perhaps we can use an Arc instead
Don't return the client from check()

mre force-pushed the feat/per-host-rate-limiting branch 3 times, most recently from bd02016 to eaab15f Compare September 8, 2025 23:53

This was referenced Sep 9, 2025

Duplicate links still checked #1815

Open

Add Per-Host Rate Limiting and Caching #1605

Open

Improve or add more information when a "Network error" error is reported #1723

Closed

thomas-zahner reviewed Sep 11, 2025

View reviewed changes

mre added 11 commits September 18, 2025 16:09

Fix lints

619ca3e

Bring back global headers (e.g. for user-agent)

7f2c50f

Pass missing args: max_redirects, timeout, allow_insecure

cb6eac6

Refactor host stats formatters to remove unused parameters and improv…

f6673ba

…e example formatting in README

mre force-pushed the feat/per-host-rate-limiting branch from eaab15f to f6673ba Compare September 18, 2025 14:23

mre added 7 commits September 18, 2025 16:28

remove confusing comment

02f3755

Create display_per_host_statistics in separate file

4760b17

Remove redundant check for self.hosts

f5aeef4

Import std::collections::HashMap

0f4aab5

Use closures instead of if

31c8592

Rename flags:

a41ea00

--default-host-concurrency -> --host-concurrency --default-request-interval -> --request-interval

Fix help formatting

5e415ad

thomaseizinger mentioned this pull request Oct 1, 2025

ci: silence 429s errors in link checker firezone/firezone#10495

Merged

github-merge-queue bot pushed a commit to firezone/firezone that referenced this pull request Oct 1, 2025

ci: silence 429s errors in link checker (#10495)

b4fae70

Until lycheeverse/lychee#1844 is shipped, we silence 429s as valid status codes whenever links are being checked by lychee.

Reduce code duplication

18393a7

thomas-zahner mentioned this pull request Oct 17, 2025

Need help with 429 errors for GitHub links #1873

Open

thomas-zahner force-pushed the master branch 2 times, most recently from fcdf77c to e0912ab Compare October 21, 2025 12:53

ofek mentioned this pull request Oct 26, 2025

Retry policy of network errors #1884

Open

Uh oh!

feat: implement per-host rate limiting and statistics #1844

Are you sure you want to change the base?

feat: implement per-host rate limiting and statistics #1844

Uh oh!

Conversation

mre commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features

CLI Changes

Configuration Options

CLI Usage

Config File Support (lychee.toml)

Breaking Changes

Other Changes

Performance Impact

Examples

Host Statistics Output

JSON Output Format

Uh oh!

mre commented Sep 7, 2025

Uh oh!

mre commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomas-zahner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mre commented Sep 11, 2025

Uh oh!

mre commented Sep 24, 2025

Uh oh!

thomas-zahner commented Oct 3, 2025

master branch

this PR

Uh oh!

mre commented Oct 9, 2025

Uh oh!

mre commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mre commented Sep 7, 2025 •

edited

Loading

Config File Support (`lychee.toml`)

mre commented Sep 7, 2025 •

edited

Loading