Skip to content

Conversation

@thomas-zahner
Copy link
Member

@thomas-zahner thomas-zahner commented Nov 21, 2025

Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)

Things to be done:

  • Don't return Client from check & update Arcs (https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264)
  • Do through manual testing and potentially reconsider new default settings
  • Fix the behavior when setting the request interval to 0
  • Think about adding additional tests (CLI test for host-specific headers, verify expected timing with different host concurrencies & request intervals)

Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Fixes #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once

@thomas-zahner thomas-zahner changed the title Per host rate limiting feat: implement per-host rate limiting and statistics Nov 27, 2025
mre and others added 20 commits November 27, 2025 13:53
Fixes fragment checking for JavaDoc-generated HTML which uses
<a name="anchor"> instead of id attributes for anchors.

This resolves a regression where lychee v0.20.1 was failing to find
fragments that worked in v0.18.1, particularly for JavaDoc URLs like:
- https://example.com/javadoc/Class.html#method--
- https://example.com/javadoc/Class.html#skip.navbar.top

The fix maintains backward compatibility by checking both 'id' and
'name' attributes when extracting fragments from HTML documents.

Resolves #1838
Add comprehensive per-host rate limiting system with adaptive backoff,
statistics tracking, and configurable concurrency controls.

Features:
- Per-host rate limiting using token bucket algorithm with governor crate
- Adaptive backoff based on server responses (429, 5xx errors)
- Host-specific request concurrency and interval controls
- Comprehensive statistics tracking (requests, success rates, response times)
- Cache hit/miss tracking per host with configurable TTL
- Multiple output formats for host statistics (compact, detailed, markdown, json)
- CLI flag --host-stats to display per-host statistics
- Configuration options for default host concurrency and request intervals

Implementation:
- Clean module structure: ratelimit/host/{host.rs, stats.rs, key.rs}
- Window data structure for rolling request time averages
- DashMap for thread-safe per-host caching with expiration
- Integration with existing cache system for persistent storage
- Formatter system matching existing lychee output styles
- Comprehensive error handling and logging

Breaking changes:
- Removed global cache in favor of clean per-host caching architecture
- Updated Client API to include host statistics methods
- Added new dependencies: governor, humantime-serde

All linting and formatting requirements satisfied.

Co-authored-by: Thomas Zahner <[email protected]>
File URLs don't have host components and should not be tracked in the
per-host rate limiting system. Only network URIs (http/https) need
rate limiting and statistics tracking.

Fixes debug errors like:
  Failed to record cache miss for file:///path#fragment:
  Rate limiting error: URL contains no host component
- Add debug messages when hosts hit rate limits (429 responses)
- Add debug messages when applying backoff delays
- Show exponential backoff progression in debug logs
- Change 'cache' to 'cached' in host statistics output for clarity

Debug output example:
  Host httpbin.org hit rate limit (429), increasing backoff from 0ms to 500ms
  Host httpbin.org applying backoff delay of 500ms due to previous rate limiting or errors

Statistics output now shows '0.0% cached' instead of '0.0% cache'
The per-host implementation was creating separate cookie jars for each host,
which broke the global --cookie-jar functionality. Now all hosts share the
same cookie jar when one is provided, while still maintaining separate
rate limiting and statistics per host.

Fixes test_cookie_jar test.
Per-host HTTP clients were missing the User-Agent header and other global
headers, causing some sites like crates.io to return 403 Forbidden errors.
Now all per-host clients inherit the global headers (User-Agent, custom
headers) while still allowing host-specific header overrides.

Fixes test_crates_io_quirk test.
Per-host clients were not respecting the max_redirects configuration,
causing redirect tests to fail. Each host now creates its own reqwest
client with proper redirect policy, timeout, and security settings
matching the main client configuration.

Fixes test_prevent_too_many_redirects test.
--default-host-concurrency -> --host-concurrency
--default-request-interval -> --request-interval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants