Skip to content

Conversation

@apfitzge
Copy link

@apfitzge apfitzge commented Aug 7, 2025

Problem

Summary of Changes

  • Make DecisionMaker lock-free (and PohRecorder free!)
  • Allow mutation of shared working bank, tick, leader_first_tick for tests
    • allows us to remove PohRecorder from few tests completely. which also means no more blockstore, or temporary directory.

Fixes #

@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2025

Codecov Report

❌ Patch coverage is 77.87611% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.2%. Comparing base (5cad0f1) to head (6d1d104).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7372   +/-   ##
=======================================
  Coverage    83.2%    83.2%           
=======================================
  Files         800      800           
  Lines      362793   362735   -58     
=======================================
- Hits       302162   302134   -28     
+ Misses      60631    60601   -30     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@apfitzge apfitzge force-pushed the decision_maker_no_lock branch from e069858 to 6d1d104 Compare August 11, 2025 16:47
@apfitzge apfitzge marked this pull request as ready for review August 11, 2025 18:07
@apfitzge apfitzge requested review from bw-solana and jstarry August 11, 2025 18:07
BufferedPacketsDecision::ForwardAndHold
} else if let Some(first_leader_tick_height) = self.shared_leader_first_tick_height.load() {
let current_tick_height = self.shared_tick_height.load();
let ticks_until_leader = first_leader_tick_height.saturating_sub(current_tick_height);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we used to explicitly check that current_tick_height<=first_leader_tick_height. Are we intentionally removing this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope i was just being dumb when writing it, and apparently it was an untested case. I'll add the case to tests and fix. It SHOULD only take this path for a future tick_height, not one that we're past.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In writing this out, I think I made a mistake in #7077.

This comment: https://github.com/anza-xyz/agave/pull/7077/files#r2222813775

explicitly stated that would_be_leader_shortly_fn returns true if we are within 2 slots of leader OR are leader but don't have a bank yet.
So I think I probably broke that second part with #7077; now it'd actually potentially be dropping packets between banks (very bad!).

The change in this PR would fix the dropping between banks.
I need to look again now at how the internal first leader tick is updated - is it possible for it remain Some past tick_height for a long period?

Copy link
Author

@apfitzge apfitzge Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be recalculated each time we clear or reset poh bank. so this PR fixes the bug i introduced in #7077 wrt intra-bank dropping of packets.

i.e. current tick past first leader tick is now treated same as within 2 slots until leader.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think it works.

  1. Consume the packets if we have a bank.
  2. Hold the packets if we're within N ticks of being leader. This could include being the actual leader.
  3. And once we finish leader span, we should immediately set leader_first_tick_height to our next leader slot (and stop holding packets if it's far in the future)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in more detail what bug was introduced in #7077 ?

Copy link

@jstarry jstarry Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bw-solana did you mean to say the "last" leader tick height? I see a check for

self.tick_height <= self.leader_last_tick_height

but not for

current_tick_height<=first_leader_tick_height

AFAICT there was no bug introduced in #7077 because before these lock-free decision maker changes, we weren't accessing the tick height before the leader first tick height being updated. So either the first tick height is none or set to the next leader slot's tick. So would_be_leader should always be correct, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should learn to read one of these days. You're correct. it checks against the last leader tick directly, so no bug.

Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

BufferedPacketsDecision::ForwardAndHold
} else if let Some(first_leader_tick_height) = self.shared_leader_first_tick_height.load() {
let current_tick_height = self.shared_tick_height.load();
let ticks_until_leader = first_leader_tick_height.saturating_sub(current_tick_height);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think it works.

  1. Consume the packets if we have a bank.
  2. Hold the packets if we're within N ticks of being leader. This could include being the actual leader.
  3. And once we finish leader span, we should immediately set leader_first_tick_height to our next leader slot (and stop holding packets if it's far in the future)

@apfitzge apfitzge merged commit 6c26728 into anza-xyz:master Aug 13, 2025
40 checks passed
@apfitzge apfitzge deleted the decision_maker_no_lock branch August 13, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants