Skip to content

Conversation

@apfitzge
Copy link

Problem

  • DecisionMaker currently takes in a Pubkey which should be the ID of the node
  • Since it only takes in the pubkey at creation, this will not actually be the leader/staked pubkey for many operating nodes
    • this means even if we were leader the pubkey comparison in DecisionMaker would likely fail
  • DecisionMaker looks very shortly into the future, 2 slots. We should not be in a situation where there is not an EpochSchedule in the cache for that slot

Summary of Changes

  • Remove pubkey from DecisionMaker
    • take path which was previously due to pubkey mismatch

Fixes #

@apfitzge apfitzge force-pushed the decision_maker_remove_pubkey branch from 0b103d3 to 3e77501 Compare July 22, 2025 12:44
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (5222e94) to head (3e77501).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7077   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         854      854           
  Lines      374686   374613   -73     
=======================================
- Hits       312083   312025   -58     
+ Misses      62603    62588   -15     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@apfitzge apfitzge marked this pull request as ready for review July 22, 2025 14:01
@apfitzge apfitzge requested review from bw-solana and tao-stones July 22, 2025 14:01
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm too stupid to understand why we ever needed the old code.. so want to make sure I'm not missing something

} else if would_be_leader_fn() {
// Node will be leader within ~20 slots, hold the transactions in
// case it is the only node which produces an accepted slot.
BufferedPacketsDecision::ForwardAndHold

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding the case where we would even hit the Hold case for the old logic.

  1. If we didn't know the leader (because we don't have the leader schedule), but in reality we generate the schedule so far in advance relative to how far ahead we look that this seems impossible outside of weird, contrived test cases.
  2. If we somehow don't think we will be leader within 20 slots based on ticks but also think we will be leader in 2 slots based on the leader schedule. Is there even a corner case where this could happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to be clear, that we hit Hold very often. But there are 3 different cases in which the old logic could have resulted in Hold:

  1. would_be_leader_shortly_fn returns true:
    • This is the one we hit a lot.
    • We are within 2 slots of being leader
    • This also includes when we are currently leader, but don't yet a Bank to operate on.
  2. leader_pubkey_fn returns Some(x) where x IS the nodes ID pubkey (at start up)
    • We search 2 slots into the future in cached leader schedule. If it IS us, then we return Hold. BUT this can only happen if our POH has an inconsistent view of who is leader in slot N+2 from the leader-schedule cache.
    • In practice the if doesn't even work. Many operators have failover setup so that when they are starting up they are in a non-voting mode with a dummy key...since we grab it at start-up, we're comparing against the wrong key in this case anyway.
  3. leader_pubkey_fn returns None
    • this shouldn't ever happen because we generate leader schedules ahead of time and only look a small offset into the future (2 slots)

Copy link

@bw-solana bw-solana Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry, 1 makes perfect sense. I was referring to the cases we're removing (2 and 3)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and to summarize my comment - 2 and 3 should NOT happen unless there are other bugs, to the best of my knowledge.

Copy link

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank for the explanations.

@apfitzge
Copy link
Author

This is non-urgent, gonna wait for @jstarry's review since he is good at finding edge-cases in my understanding of poh.

@apfitzge apfitzge requested a review from jstarry July 22, 2025 16:49
Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@apfitzge apfitzge merged commit eba01dd into anza-xyz:master Jul 25, 2025
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants