-
Notifications
You must be signed in to change notification settings - Fork 253
Reprovide Sweep #1095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Reprovide Sweep #1095
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jul 9, 2025
Merged
2 tasks
This was referenced Jul 18, 2025
Merged
42 tasks
This was referenced Jul 23, 2025
Merged
This was referenced Aug 6, 2025
Merged
* SweepingProvider interface * updated interface to address reviews * LastProvideAt * ProvideStatus * Update provider/provider.go Co-authored-by: Marcin Rataj <[email protected]> * renamed `ts` to `lastProvide` * add TODO for missing implementation --------- Co-authored-by: Marcin Rataj <[email protected]>
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * close avgPrefixLenReady to signal initialPrefixLen measurement is done
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * simplify unscheduleSubsumedPrefixesNoClock * moved maxPrefixSize const to top
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * simplify unscheduleSubsumedPrefixesNoClock * address review * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top * address review
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top * address review * address review
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * provider: catchup pending work * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * dequeue outside of go routine * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top * address review * address review * address review
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * provider: catchup pending work * provider: options * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * dequeue outside of go routine * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top * address review * address review * address review * don't allocate capacity to avgPrefixLenReady
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * provider: catchup pending work * provider: options * provide: handle reprovide * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * dequeue outside of go routine * refactor and test groupAndScheduleKeysByPrefix * moved maxPrefixSize const to top * address review * address review * address review * don't allocate capacity to avgPrefixLenReady * address review
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * provider: catchup pending work * provider: options * provide: handle reprovide * provider: daemon * cancel context of external functions + tests * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * dequeue outside of go routine * close connectivity * address review * optimise expensive calls to trie.Size()
* provider: adding provide and reprovide queue * provider: network operations * add some tests * schedule prefix len computations * provider schedule * provider: handleProvide * addressed review * use go-test/random * satisfy linter * log errors during initial prefix len measurement * address review * satisfy linter * address review * provider: explore swarm * provider: batch provide * provider: batch reprovide * provider: catchup pending work * provider: options * provide: handle reprovide * provider: daemon * provider: integration tests * cancel context of external functions + tests * fix panic when adding key to trie if superstring already exists * address review * decrease minimal region size from replicationFactor+1 to replicationFactor * simplify unscheduleSubsumedPrefixesNoClock * address review * fix test to match region size (now: replicationFactor, before: replicationFactor+1) * dequeue outside of go routine * fix tests * close connectivity * fix waitgroup
* provider: refresh schedule * address review
* provider: refresh schedule * dual: provider * fix: flaky TestStartProvidingUnstableNetwork * addressing review
* provider: minor fixes * adjusting for kubo PR * address review
* update ConnectivityChecker * ai tests * update connectivity state machine * remove useless connectivity funcs * connectivity: get rid of internal state * docs and tests * fix(dual/provider): don't prevent providing if a DHT returns an error * address review
* keystore: revamp * fix: KeyStore.Close dependencies * keystore: implement KeyStore4
* provider: ResettableKeyStore * keystore: remove mutex * use datastore namespace * don't sync to write to altDs * simplify put * deduplicate operation execution code
* provider: ResettableKeyStore * keystore: remove mutex * use datastore namespace * don't sync to write to altDs * simplify put * deduplicate operation execution code * buffered provider * tests * removing redundant code * docs * wait on empty queue * fix flaky test
* provider: ResettableKeyStore * keystore: remove mutex * use datastore namespace * don't sync to write to altDs * simplify put * deduplicate operation execution code * addressing gammazero review
135419f to
c35f275
Compare
c35f275 to
a5b7683
Compare
lidel
approved these changes
Sep 18, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Supersedes #1082
Reprovide Sweep
Problem
Reproviding many keys to the DHT one by one is inefficient, because it requires a
GetClosestPeers(orGCP) request for every key.Current state
Currently, reprovides are managed in
boxo/provider. EveryReprovideInterval(22hin Amino DHT), all keys matching the reprovide strategy are reprovided at once. The process is slightly different depending on whether the accelerated DHT client is enabled.Default DHT client
All the keys are reprovided sequentially, using the
go-libp2p-kad-dhtProvide() method. This operation consists in finding thekclosest peers to the given key, and then request them all to store the associated provider record.The process is expensive because it requires a
GCPfor each key (opening approx. 20-30 connections). Timeouts due to unreachable peers make this process very long, resulting in a mean of ~10s in provide time (source: probelab.io 2025-06-13).With 10 seconds per provide, a node using this process could reprovide less than 8'000 keys over the reprovide interval of 22h (using a single thread).
Accelerated DHT client (
fullrt)The accelerated DHT client periodically (every 1h) crawls the DHT swarm to cache the addresses of all discovered peers. It allows it to skip the GCP during the provide request, since it already knows the
kclosest peers and the associated multiaddrs.Hence, the accelerated DHT client is able to provide much more keys during the reprovide interval compared with the default DHT client. However, crawling the DHT swarm is an expensive operation (networking, memory), and since all the keys are reprovided at once, the node will experience a bust period until all keys are reprovided.
Ideally, nodes wouldn't have to crawl the swarm to reprovide content, and the reprovide operation could be smoothed over time to avoid a bust during which the libp2p node is incapable of performing other actions.
Pooling Reprovides
If there are more keys to be reprovided than the number of nodes in the DHT swarm divided by the replication factor (
k), then it means that there are at least two keys that will be provided to the exact same set of peers. This means that the number ofGCPis less than the number of keys to reprovide.For the Amino DHT, containing ~10k DHT servers and having a replication factor of 20, pooling reprovides becomes efficient starting from 500 keys.
Reprovide Sweep
The current process of reproviding all keys at once is bad because it creates a bust. In order to smooth the reprovide process, we can sweep the keyspace from left to right, in order to cover all peers over time. This consists of exploring keyspace regions, corresponding to a set of peers that are close to each other in the Kademlia XOR distance metric.
A keyspace region is explored using a few (typically 2-4 GCP) to discover all the peers it contains. A keyspace region can be identified by a Kademlia identifier prefix, the kademlia identifiers of all peers within this region start with the region's prefix.
Once a region is fully explored, all the keys matching the keyspace region's prefix can be allocated to this set of peers. No additional GCP is needed.
Pull Requests
Done
helpers, so not renamed in all PRs) -> provider: helpers pacakge rename #1111Ready for review (PRs must be merged in this order)
Ready for review (nice-to-have)
Issues
Optional improvements
ipfs provide status <cid>ProviderQueueandReproviderQueuesizeproviderpackage readmeAdmin
Depends on:
Need new release of:
Closes #824
Part of ipshipyard/roadmaps#6, ipshipyard/roadmaps#7, ipshipyard/roadmaps#8