Skip to content

Conversation

@gregcusack
Copy link

Problem

We had previously added in a metric for tracking gossip push messages through the network in PR: #32725. However, this metric does not account for redundant pull requests.
Redundant Pull: A node receives a message via PullResponse and then receives the same message via Push.
Redundant Pulls prevent us from accurately calculating how well messages are propagating via Push.

Summary of Changes

Add in a metric to report when we receive a Push for a message we already (and first) received via PullResponse
Modify VersionedCrdsValue.

  • num_push_dups serves the same basic purpose.
  • However, now if num_push_dups is set to u8::MAX that means that this VersionedCrdsValue was first received via a PullResponse

Identifying redundant Pulls:

  1. Receive a message via PullRequest that successfully updates crds.table, set the num_push_dups of this message to u8::MAX
  2. Receive the same message again but via Push, it will fail to insert. Since the already existing entry has num_push_dups == u8::MAX, we know this is a Redundant Pull.
  3. Push gossip_crds_redundant_pull to metrics.
  4. Reset num_push_dups to 0
  5. Increment num_push_dups (we will do this last step for every duplicate push)

Possible Issues

  1. Adding in more metrics to an already heavily used metrics server. Could possibly remove these metrics once we get data we need.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (adefcbb) to head (215a465).
Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #139     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         838      838             
  Lines      225913   225926     +13     
=========================================
+ Hits       184935   184937      +2     
- Misses      40978    40989     +11     

Comment on lines +153 to +155
// set num_push_dups to 2^8 to indicate PullResponse
// unlikely a node receives the exact same message via push 2^8 times
GossipRoute::PullResponse => u8::MAX,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of sentinel values are pretty risky and can introduce subtle bugs if someone later introduces a change unaware of this u8::MAX special casing.
I am more inclined if we just change num_push_dups to num_push_recv (or something like that), and simply count how many times this value was received through the push path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so when ReceivedCache uses num_push_recv instead of num_push_dups, should we then look at num_push_recv - 1 in ReceivedCacheEntry::record():

if num_dups < Self::NUM_DUPS_THRESHOLD {
?

Comment on lines +313 to +324
datapoint_info!(
"gossip_crds_redundant_pull",
(
"origin",
value.value.pubkey().to_string().get(..8),
Option<String>
),
(
"signature",
value.value.signature.to_string().get(..8),
Option<String>
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably would be too many metrics.
Can we just increment a counter in self.stats and periodically submit an aggregate metric like the rest of self.stats?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya spun up a 100 node test with this code and this was reporting a ton of data.
And ya probably. let me think how we can measure the overall push coverage with an aggregate redundant pull metric. That would also avoid the bad compression trent mentioned: #139 (comment)

Comment on lines +314 to +324
"gossip_crds_redundant_pull",
(
"origin",
value.value.pubkey().to_string().get(..8),
Option<String>
),
(
"signature",
value.value.signature.to_string().get(..8),
Option<String>
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these effectively random strings do not compress well in the metrics db and are difficult to query. i'd highly recommend rethinking the accounting to get the information you need without them

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not think about this. good to know. Will try to do something like what behzad mentioned: #139 (comment)

@gregcusack
Copy link
Author

closing in favor of: #199

@gregcusack gregcusack closed this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants