-
Notifications
You must be signed in to change notification settings - Fork 890
add in metrics for detecting redundant pulls #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #139 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 838 838
Lines 225913 225926 +13
=========================================
+ Hits 184935 184937 +2
- Misses 40978 40989 +11 |
| // set num_push_dups to 2^8 to indicate PullResponse | ||
| // unlikely a node receives the exact same message via push 2^8 times | ||
| GossipRoute::PullResponse => u8::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kind of sentinel values are pretty risky and can introduce subtle bugs if someone later introduces a change unaware of this u8::MAX special casing.
I am more inclined if we just change num_push_dups to num_push_recv (or something like that), and simply count how many times this value was received through the push path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so when ReceivedCache uses num_push_recv instead of num_push_dups, should we then look at num_push_recv - 1 in ReceivedCacheEntry::record():
agave/gossip/src/received_cache.rs
Line 81 in 3863bb1
| if num_dups < Self::NUM_DUPS_THRESHOLD { |
| datapoint_info!( | ||
| "gossip_crds_redundant_pull", | ||
| ( | ||
| "origin", | ||
| value.value.pubkey().to_string().get(..8), | ||
| Option<String> | ||
| ), | ||
| ( | ||
| "signature", | ||
| value.value.signature.to_string().get(..8), | ||
| Option<String> | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably would be too many metrics.
Can we just increment a counter in self.stats and periodically submit an aggregate metric like the rest of self.stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya spun up a 100 node test with this code and this was reporting a ton of data.
And ya probably. let me think how we can measure the overall push coverage with an aggregate redundant pull metric. That would also avoid the bad compression trent mentioned: #139 (comment)
| "gossip_crds_redundant_pull", | ||
| ( | ||
| "origin", | ||
| value.value.pubkey().to_string().get(..8), | ||
| Option<String> | ||
| ), | ||
| ( | ||
| "signature", | ||
| value.value.signature.to_string().get(..8), | ||
| Option<String> | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these effectively random strings do not compress well in the metrics db and are difficult to query. i'd highly recommend rethinking the accounting to get the information you need without them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not think about this. good to know. Will try to do something like what behzad mentioned: #139 (comment)
|
closing in favor of: #199 |
Problem
We had previously added in a metric for tracking gossip push messages through the network in PR: #32725. However, this metric does not account for redundant pull requests.
Redundant Pull: A node receives a message via
PullResponseand then receives the same message viaPush.Redundant Pulls prevent us from accurately calculating how well messages are propagating via
Push.Summary of Changes
Add in a metric to report when we receive a
Pushfor a message we already (and first) received viaPullResponseModify
VersionedCrdsValue.num_push_dupsserves the same basic purpose.num_push_dupsis set tou8::MAXthat means that thisVersionedCrdsValuewas first received via aPullResponseIdentifying redundant Pulls:
PullRequestthat successfully updatescrds.table, set thenum_push_dupsof this message tou8::MAXPush, it will fail to insert. Since the already existing entry hasnum_push_dups == u8::MAX, we know this is a Redundant Pull.gossip_crds_redundant_pullto metrics.num_push_dupsto 0num_push_dups(we will do this last step for every duplicate push)Possible Issues