add in metrics for detecting redundant pulls #139

gregcusack · 2024-03-07T21:43:00Z

Problem

We had previously added in a metric for tracking gossip push messages through the network in PR: #32725. However, this metric does not account for redundant pull requests.
Redundant Pull: A node receives a message via PullResponse and then receives the same message via Push.
Redundant Pulls prevent us from accurately calculating how well messages are propagating via Push.

Summary of Changes

Add in a metric to report when we receive a Push for a message we already (and first) received via PullResponse
Modify VersionedCrdsValue.

num_push_dups serves the same basic purpose.
However, now if num_push_dups is set to u8::MAX that means that this VersionedCrdsValue was first received via a PullResponse

Identifying redundant Pulls:

Receive a message via PullRequest that successfully updates crds.table, set the num_push_dups of this message to u8::MAX
Receive the same message again but via Push, it will fail to insert. Since the already existing entry has num_push_dups == u8::MAX, we know this is a Redundant Pull.
Push gossip_crds_redundant_pull to metrics.
Reset num_push_dups to 0
Increment num_push_dups (we will do this last step for every duplicate push)

Possible Issues

Adding in more metrics to an already heavily used metrics server. Could possibly remove these metrics once we get data we need.

codecov-commenter · 2024-03-07T23:07:57Z

Codecov Report

Attention: Patch coverage is 88.23529% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (adefcbb) to head (215a465).
Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #139     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         838      838             
  Lines      225913   225926     +13     
=========================================
+ Hits       184935   184937      +2     
- Misses      40978    40989     +11

behzadnouri · 2024-03-08T16:45:43Z

gossip/src/crds.rs

+            // set num_push_dups to 2^8 to indicate PullResponse
+            // unlikely a node receives the exact same message via push 2^8 times
+            GossipRoute::PullResponse => u8::MAX,


These kind of sentinel values are pretty risky and can introduce subtle bugs if someone later introduces a change unaware of this u8::MAX special casing.
I am more inclined if we just change num_push_dups to num_push_recv (or something like that), and simply count how many times this value was received through the push path.

so when ReceivedCache uses num_push_recv instead of num_push_dups, should we then look at num_push_recv - 1 in ReceivedCacheEntry::record():

agave/gossip/src/received_cache.rs

Line 81 in 3863bb1

if num_dups < Self::NUM_DUPS_THRESHOLD {

?

behzadnouri · 2024-03-08T16:49:45Z

gossip/src/crds.rs

+                        datapoint_info!(
+                            "gossip_crds_redundant_pull",
+                            (
+                                "origin",
+                                value.value.pubkey().to_string().get(..8),
+                                Option<String>
+                            ),
+                            (
+                                "signature",
+                                value.value.signature.to_string().get(..8),
+                                Option<String>
+                            )


This probably would be too many metrics.
Can we just increment a counter in self.stats and periodically submit an aggregate metric like the rest of self.stats?

ya spun up a 100 node test with this code and this was reporting a ton of data.
And ya probably. let me think how we can measure the overall push coverage with an aggregate redundant pull metric. That would also avoid the bad compression trent mentioned: #139 (comment)

t-nelson · 2024-03-08T19:20:59Z

gossip/src/crds.rs

+                            "gossip_crds_redundant_pull",
+                            (
+                                "origin",
+                                value.value.pubkey().to_string().get(..8),
+                                Option<String>
+                            ),
+                            (
+                                "signature",
+                                value.value.signature.to_string().get(..8),
+                                Option<String>
+                            )


these effectively random strings do not compress well in the metrics db and are difficult to query. i'd highly recommend rethinking the accounting to get the information you need without them

did not think about this. good to know. Will try to do something like what behzad mentioned: #139 (comment)

gregcusack · 2024-03-12T03:06:41Z

closing in favor of: #199

add in metric for tracking redundant pull messages

215a465

gregcusack mentioned this pull request Mar 7, 2024

add in metrics for detecting redundant pulls #113

Closed

behzadnouri reviewed Mar 8, 2024

View reviewed changes

t-nelson reviewed Mar 8, 2024

View reviewed changes

gregcusack closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add in metrics for detecting redundant pulls #139

add in metrics for detecting redundant pulls #139

Uh oh!

gregcusack commented Mar 7, 2024

Uh oh!

codecov-commenter commented Mar 7, 2024

Uh oh!

behzadnouri Mar 8, 2024

Uh oh!

gregcusack Mar 11, 2024

Uh oh!

behzadnouri Mar 8, 2024

Uh oh!

gregcusack Mar 11, 2024

Uh oh!

t-nelson Mar 8, 2024

Uh oh!

gregcusack Mar 11, 2024

Uh oh!

gregcusack commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add in metrics for detecting redundant pulls #139

add in metrics for detecting redundant pulls #139

Uh oh!

Conversation

gregcusack commented Mar 7, 2024

Problem

Summary of Changes

Identifying redundant Pulls:

Possible Issues

Uh oh!

codecov-commenter commented Mar 7, 2024

Codecov Report

Uh oh!

behzadnouri Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

gregcusack Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

behzadnouri Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

gregcusack Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

t-nelson Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

gregcusack Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

gregcusack commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants