-
-
Notifications
You must be signed in to change notification settings - Fork 183
feat: implement per-host rate limiting and statistics #1844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Looks like crates.io has made some changes. I'll rerun it tomorrow to see if the issue remains. |
|
Fixed: File URLs () were incorrectly being processed through the rate limiting system, causing debug errors like: [DEBUG] Failed to record cache miss for file:///private/tmp/input.md#development-environment: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#development-environment: URL contains no host component
[DEBUG] Failed to record cache miss for file:///private/tmp/input.md#react-native-awesome-components: Rate limiting error: Failed to parse rate limit headers from file:///private/tmp/input.md#react-native-awesome-components: URL contains no host componentSolution: File URLs are now properly excluded from rate limiting tracking since they're local filesystem operations, not network requests. |
bd02016 to
eaab15f
Compare
thomas-zahner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review batch (10/27 files viewed) 😉
|
Good points, thanks! I'll go through the review once I find the time. |
Fixes fragment checking for JavaDoc-generated HTML which uses <a name="anchor"> instead of id attributes for anchors. This resolves a regression where lychee v0.20.1 was failing to find fragments that worked in v0.18.1, particularly for JavaDoc URLs like: - https://example.com/javadoc/Class.html#method-- - https://example.com/javadoc/Class.html#skip.navbar.top The fix maintains backward compatibility by checking both 'id' and 'name' attributes when extracting fragments from HTML documents. Resolves #1838
Add comprehensive per-host rate limiting system with adaptive backoff,
statistics tracking, and configurable concurrency controls.
Features:
- Per-host rate limiting using token bucket algorithm with governor crate
- Adaptive backoff based on server responses (429, 5xx errors)
- Host-specific request concurrency and interval controls
- Comprehensive statistics tracking (requests, success rates, response times)
- Cache hit/miss tracking per host with configurable TTL
- Multiple output formats for host statistics (compact, detailed, markdown, json)
- CLI flag --host-stats to display per-host statistics
- Configuration options for default host concurrency and request intervals
Implementation:
- Clean module structure: ratelimit/host/{host.rs, stats.rs, key.rs}
- Window data structure for rolling request time averages
- DashMap for thread-safe per-host caching with expiration
- Integration with existing cache system for persistent storage
- Formatter system matching existing lychee output styles
- Comprehensive error handling and logging
Breaking changes:
- Removed global cache in favor of clean per-host caching architecture
- Updated Client API to include host statistics methods
- Added new dependencies: governor, humantime-serde
All linting and formatting requirements satisfied.
File URLs don't have host components and should not be tracked in the per-host rate limiting system. Only network URIs (http/https) need rate limiting and statistics tracking. Fixes debug errors like: Failed to record cache miss for file:///path#fragment: Rate limiting error: URL contains no host component
- Add debug messages when hosts hit rate limits (429 responses) - Add debug messages when applying backoff delays - Show exponential backoff progression in debug logs - Change 'cache' to 'cached' in host statistics output for clarity Debug output example: Host httpbin.org hit rate limit (429), increasing backoff from 0ms to 500ms Host httpbin.org applying backoff delay of 500ms due to previous rate limiting or errors Statistics output now shows '0.0% cached' instead of '0.0% cache'
The per-host implementation was creating separate cookie jars for each host, which broke the global --cookie-jar functionality. Now all hosts share the same cookie jar when one is provided, while still maintaining separate rate limiting and statistics per host. Fixes test_cookie_jar test.
Per-host HTTP clients were missing the User-Agent header and other global headers, causing some sites like crates.io to return 403 Forbidden errors. Now all per-host clients inherit the global headers (User-Agent, custom headers) while still allowing host-specific header overrides. Fixes test_crates_io_quirk test.
Per-host clients were not respecting the max_redirects configuration, causing redirect tests to fail. Each host now creates its own reqwest client with proper redirect policy, timeout, and security settings matching the main client configuration. Fixes test_prevent_too_many_redirects test.
…e example formatting in README
eaab15f to
f6673ba
Compare
--default-host-concurrency -> --host-concurrency --default-request-interval -> --request-interval
|
Just so that I don't forget, I will need to fix the behavior when setting the request interval to 0. I'm getting some parsing errors. |
Until lycheeverse/lychee#1844 is shipped, we silence 429s as valid status codes whenever links are being checked by lychee.
|
@mre I seem to have a considerable performance regression with this PR. This might be related to the concurrency changes and maybe also with my comment here? I haven't done any through analysis or anything but I noticed it just by running lychee on the README and wanted to let you know. I switched multiple times between master and this PR and the performance degradation is very noticeable. master branchthis PRSometimes I even felt like the link check process was stuck in the end. |
|
I will test this on my machine when I get the time. |
|
Todo:
|
fcdf77c to
e0912ab
Compare
This PR implements a comprehensive per-host rate limiting system with adaptive backoff, statistics tracking, and configurable concurrency controls for lychee.
Fixes #1605 - Add Per-Host Rate Limiting and Caching
Fixes #989 - Add custom delay between requests (prevent ban)
Addresses #367 - 429 Too Many Requests
Addresses #1593 - The cache is ineffective with the default concurrency
Addresses #1815 - All duplicates should get removed, and no link ever gets checked more than once
Key Features
CLI Changes
--host-statsflag to display per-host statistics at the end of runs--default-host-concurrencyoption to set per-host concurrency limits--default-request-intervaloption to set minimum intervals between requestsConfiguration Options
CLI Usage
Config File Support (
lychee.toml)Breaking Changes
Other Changes
governor,humantime-serdePerformance Impact
Examples
Host Statistics Output
JSON Output Format
{ "host_statistics": { "github.com": { "total_requests": 45, "successful_requests": 43, "success_rate": 95.6, "rate_limited": 1, "client_errors": 1, "server_errors": 0, "median_request_time_ms": 320, "cache_hits": 5, "cache_misses": 35, "cache_hit_rate": 12.5 } } }