-
-
Notifications
You must be signed in to change notification settings - Fork 185
Labels
Milestone
Description
Currently, lychee faces challenges with rate limiting and cache effectiveness when checking links, particularly when dealing with multiple requests to the same hosts. This leads to several issues that need to be addressed:
Current Problems
- Multiple concurrent requests to the same host trigger rate limits (429 errors) (See Add custom delay inbetween requests (prevent ban) #989)
- Cache is ineffective with high concurrency due to race conditions (See The cache is ineffective with the default concurrency, for links in a website's theme #1593 (comment))
- Global concurrency settings are too coarse-grained
- Different hosts have different rate limit requirements
- Headers are applied to all hosts, causing potential security issues (Security: restrict custom HTTP request headers to specific URL patterns #1298 and custom Header not sent #1441 (comment))
Proposed Solution
We should implement a smart per-host rate limiting and caching system that would:
- Track rate limits per host using a concurrent HashMap:
use std::collections::HashMap;
use time::OffsetDateTime;
struct HostConfig {
rate_limit_reset: Option<OffsetDateTime>,
request_delay: Option<Duration>,
max_concurrent_requests: Option<u32>,
}- Implement smarter caching:
- Maintain separate cache states per host
- Stretch goal: Add configuration options per host:
lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100ms- Stretch goal II: Add support for per-host headers
The idea would be to maintain a HeaderMap.
See #1297 for details.
Implementation Notes
- Use the existing rate-limits crate, which is mostly useful for APIs
- Handle 429 responses with proper backoff using response headers when available
Benefits
- Prevents IP bans from aggressive checking
- More efficient resource usage
- Better compliance with API rate limits
- Improved cache effectiveness. Since the cache is per host, there would be no synchronization issues
- Faster overall execution by avoiding unnecessary retries
Examples
[hosts."github.com"]
max_concurrent_requests = 10
request_delay = "100ms"
headers = { Authorization = "token ghp_xxxx", "User-Agent" = "my-bot" }
[hosts."api.example.com"]
max_concurrent_requests = 1
request_delay = "1s"
headers = { "X-API-Key" = "secret", Accept = "application/json" }CLI usage example:
lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100msAnd when adding headers:
lychee \
--max-concurrency-per-host github.com=10 \
--delay-per-host github.com=100ms \
--headers-per-host 'github.com=Authorization:token ghp_xxxx,User-Agent:my-bot'This is just a proposal. I'm not 100% certain about the naming yet.
mxmehl, phieri, gukoff, graste, afalhambra-hivemq and 2 more