Skip to content

Conversation

@gregcusack
Copy link

Problem

We had previously added in a metric for tracking gossip push messages through the network in PR: #32725. However, this metric does not account for redundant pull requests.
Redundant Pull: A node receives a message via PullResponse and then receives the same message via Push.
Redundant Pulls prevent us from accurately calculating how well messages are propagating via Push.

Summary of Changes

Add in metric to report when a node receives a NEW message via PullResponse (gossip_crds_sample_pull).
Add in a metric to report when a node receives a message via Push but fails to insert (gossip_crds_sample_fail).

Identifying redundant Pulls:

  1. Get message signatures reported in gossip_crds_sample_pull
  2. Get message signatures reported in gossip_crds_sample_fail
  3. Take the intersection of signatures from (1) and (2)
  4. The intersection of these sets results in all messages that were first received via Pull and then received via Push (aka a Redundant Pull)

Simulation Results

In a 100 node simulation, I saw Redundant Pulls occur somewhat frequently. This indicates Redundant Pulls may be the reason for the discrepancy between the simulated Push coverage and measured Push coverage

Possible Issues

  1. Adding in more metrics to an already heavily used metrics server. Could possibly remove these metrics once we get data we need.

@gregcusack gregcusack requested a review from behzadnouri March 6, 2024 16:59
@gregcusack gregcusack force-pushed the redundant-pull-metrics branch from 298964f to 0f4dcb9 Compare March 6, 2024 18:15
@behzadnouri
Copy link

  • Get message signatures reported in gossip_crds_sample_pull

  • Get message signatures reported in gossip_crds_sample_fail

  • Take the intersection of signatures from (1) and (2)

  • The intersection of these sets results in all messages that were first received via Pull and then received via Push (aka a Redundant Pull)

So this requires a lot of offline processing.
I was thinking of something simpler and not restricted to sampled messages.

Basically, we can change num_push_dups here:
https://github.com/anza-xyz/agave/blob/adefcbbb4/gossip/src/crds.rs#L130-L131
to be just num_push. So

  • a newly inserted value from GossipRoute::PushMessage will initialize this value to 1.
  • a newly inserted value from GossipRoute::PullResponse will initialize this value to 0.
  • (have to think what to do about other GossipRoute::* cases).

(also have to accordingly update a lot of other places which use num_push_dups).

Then if you receive the same value again from GossipRoute::PushMessage:
https://github.com/anza-xyz/agave/blob/adefcbbb4/gossip/src/crds.rs#L304-L308
but num_push is zero then that indicates a redundant pull, and we can either add a metric to CrdsStats to record that or return a new error message to the caller of crds.insert to process that at the call-site.

I would suggest lets hold on to this pr for now, but first implement above simpler approach in a separate pr and lets see how that looks like.

@gregcusack
Copy link
Author

I was thinking of something simpler and not restricted to sampled messages.

My only concern here is that in a simple test, there appeared to be a lot of redundant pulls, so it is possible this would be very heavy on metrics server. But I can give it a shot and run some tests with your suggested changes and see what we see. If it's too much we can then sample.

@gregcusack
Copy link
Author

Closing in favor of PR: #139

@gregcusack gregcusack closed this Mar 7, 2024
OliverNChalk pushed a commit to OliverNChalk/agave that referenced this pull request Nov 11, 2025
Originally written by Andrew Fitzgerald <[email protected]> on Wed Aug
9 14:57:55 2023 -0700.

Previous version:

    commit 86a2b8f8aa19e606bd6396dfdbc6f35950b23ee9
    Author: Andrew Fitzgerald <[email protected]>
    Date:   Wed Aug 9 14:57:55 2023 -0700

        Spawn adversarial and normal banking stages (anza-xyz#113)

Rewritten to match the upstream scheduler code as of anza-xyz#5467 by Illia
Bobyr <[email protected]>.

This change includes all of the following changes:

---

Author: Illia Bobyr <[email protected]>
Date:   Mon Oct 2 14:03:46 2023 -0700

    adversary: test_scheduler => attack_scheduler (anza-xyz#175)

    We mostly talk about attacks, when we discuss the functionality this
    code supports.  Considering that we have a lot of other kinds of tests,
    it seems a bit clearer to call use "attack" in this part of the code.

Author: Illia Bobyr <[email protected]>
Date:   Tue Oct 3 15:21:33 2023 -0700

    adversary: test_generators => transaction_generators (anza-xyz#178)

    We mostly talk about "attacks" rather than "tests" in this part of the
    code.  And even the main type in the `test_generators` module is called
    `TransactionGenerator`.

Author: kirill lykov <[email protected]>
Date:   Thu Feb 8 10:52:48 2024 +0100

    replay: atomicbool instead of singleton for dropping packets (anza-xyz#224)

    * use atomicbool instead of singleton to drop packets

    * add use for Ordering

    Co-authored-by: Illia Bobyr <[email protected]>
    Signed-off-by: kirill lykov <[email protected]>

    * rename drop_packets

    ---------

    Signed-off-by: kirill lykov <[email protected]>
    Co-authored-by: Illia Bobyr <[email protected]>

Author: Brennan <[email protected]>
Date:   Fri Mar 22 06:45:29 2024 -0700

    remove dead code (anza-xyz#298)

Author: Andrew Fitzgerald <[email protected]>
Date:   Tue Jul 16 14:49:59 2024 -0500

    AdversarialBankingStage: Remove warning (anza-xyz#370)

    Remove warning. Adjust names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants