-
-
Notifications
You must be signed in to change notification settings - Fork 185
feat: implement per-host rate limiting and statistics #1929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
thomas-zahner
wants to merge
25
commits into
master
Choose a base branch
from
per-host-rate-limiting
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,047
−488
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes fragment checking for JavaDoc-generated HTML which uses <a name="anchor"> instead of id attributes for anchors. This resolves a regression where lychee v0.20.1 was failing to find fragments that worked in v0.18.1, particularly for JavaDoc URLs like: - https://example.com/javadoc/Class.html#method-- - https://example.com/javadoc/Class.html#skip.navbar.top The fix maintains backward compatibility by checking both 'id' and 'name' attributes when extracting fragments from HTML documents. Resolves #1838
Add comprehensive per-host rate limiting system with adaptive backoff,
statistics tracking, and configurable concurrency controls.
Features:
- Per-host rate limiting using token bucket algorithm with governor crate
- Adaptive backoff based on server responses (429, 5xx errors)
- Host-specific request concurrency and interval controls
- Comprehensive statistics tracking (requests, success rates, response times)
- Cache hit/miss tracking per host with configurable TTL
- Multiple output formats for host statistics (compact, detailed, markdown, json)
- CLI flag --host-stats to display per-host statistics
- Configuration options for default host concurrency and request intervals
Implementation:
- Clean module structure: ratelimit/host/{host.rs, stats.rs, key.rs}
- Window data structure for rolling request time averages
- DashMap for thread-safe per-host caching with expiration
- Integration with existing cache system for persistent storage
- Formatter system matching existing lychee output styles
- Comprehensive error handling and logging
Breaking changes:
- Removed global cache in favor of clean per-host caching architecture
- Updated Client API to include host statistics methods
- Added new dependencies: governor, humantime-serde
All linting and formatting requirements satisfied.
Co-authored-by: Thomas Zahner <[email protected]>
File URLs don't have host components and should not be tracked in the per-host rate limiting system. Only network URIs (http/https) need rate limiting and statistics tracking. Fixes debug errors like: Failed to record cache miss for file:///path#fragment: Rate limiting error: URL contains no host component
- Add debug messages when hosts hit rate limits (429 responses) - Add debug messages when applying backoff delays - Show exponential backoff progression in debug logs - Change 'cache' to 'cached' in host statistics output for clarity Debug output example: Host httpbin.org hit rate limit (429), increasing backoff from 0ms to 500ms Host httpbin.org applying backoff delay of 500ms due to previous rate limiting or errors Statistics output now shows '0.0% cached' instead of '0.0% cache'
The per-host implementation was creating separate cookie jars for each host, which broke the global --cookie-jar functionality. Now all hosts share the same cookie jar when one is provided, while still maintaining separate rate limiting and statistics per host. Fixes test_cookie_jar test.
Per-host HTTP clients were missing the User-Agent header and other global headers, causing some sites like crates.io to return 403 Forbidden errors. Now all per-host clients inherit the global headers (User-Agent, custom headers) while still allowing host-specific header overrides. Fixes test_crates_io_quirk test.
Per-host clients were not respecting the max_redirects configuration, causing redirect tests to fail. Each host now creates its own reqwest client with proper redirect policy, timeout, and security settings matching the main client configuration. Fixes test_prevent_too_many_redirects test.
…e example formatting in README
--default-host-concurrency -> --host-concurrency --default-request-interval -> --request-interval
82f8e9b to
8c5ccbc
Compare
Includes removing boilerplate code and removing instances of Arc.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Supersedes #1844. Rebased master to resolve conflicts. Since there were many conflicts and I spent some time resolving them I didn't force push to the existing branch in case I messed something up. (it seems like I didn't)
Things to be done:
Clientfromcheck& updateArcs (https://github.com/lycheeverse/lychee/pull/1844/files#r2360356264)Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Fixes #1298 - restrict custom HTTP request headers to specific URL patterns
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once