Skip to content

Conversation

@mergify
Copy link

@mergify mergify bot commented Apr 18, 2024

Problem

I've observed that a lot of skipped slots are caused when a leader "A" tries to skip the previous leader "B" when producing its blocks but the previous leader had actually already started broadcasting shreds for its blocks. In this case, there's a race between the forks of leader A and B and it's almost certain that leader B's fork will be confirmed because its block will likely be finished and replayed by the cluster before leader A's block that only just started being produced.

Summary of Changes

Currently when leader A skips all of leader B's blocks, it doesn't use grace ticks when deciding when to start building its block. After this PR, when a leader skips all of the previous leader's blocks and has ticked to its leader slot (without grace ticks) it will first check if it has received any shreds for a potential new reset bank. If it has received some shreds, it will apply grace ticks to wait a bit longer to allow time for the corresponding bank for those shreds to be frozen (or marked dead). If it hasn't received any shreds, it can go ahead with producing its block without waiting for grace ticks as before.

  • Added new hidden validator cli arg --delay-leader-block-for-pending-fork which allows validators to opt into this new behavior.
  • Added new metric poh_recorder-detected_pending_fork which reports when a pending fork was detected and whether or not the validator yielded

Fixes #


This is an automatic backport of pull request #794 done by [Mergify](https://mergify.com).

* Use poh grace ticks when new reset bank is pending

* feedback

* make it hidden

(cherry picked from commit 1c1b4c3)

# Conflicts:
#	core/src/validator.rs
#	local-cluster/src/validator_configs.rs
#	validator/src/main.rs
@mergify mergify bot added the conflicts label Apr 18, 2024
@mergify
Copy link
Author

mergify bot commented Apr 18, 2024

Cherry-pick of 1c1b4c3 has failed:

On branch mergify/bp/v1.18/pr-794
Your branch is up to date with 'origin/v1.18'.

You are currently cherry-picking commit 1c1b4c3e28.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   core/src/replay_stage.rs
	modified:   poh/src/poh_recorder.rs
	modified:   validator/src/cli.rs

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   core/src/validator.rs
	both modified:   local-cluster/src/validator_configs.rs
	both modified:   validator/src/main.rs

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@jstarry jstarry requested review from AshwinSekar and carllin April 19, 2024 02:15
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 94.82759% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (d7eb8f9) to head (a518a55).

Additional details and impacted files
@@           Coverage Diff            @@
##            v1.18     #886    +/-   ##
========================================
  Coverage    81.5%    81.6%            
========================================
  Files         827      827            
  Lines      224848   224903    +55     
========================================
+ Hits       183453   183565   +112     
+ Misses      41395    41338    -57     

@jstarry
Copy link

jstarry commented May 29, 2024

@AshwinSekar @carllin I'm still waiting on a review here

Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the delay - slipped through my github notifications.

@jstarry
Copy link

jstarry commented May 30, 2024

v1.18 is basically prod now that mainnet beta is upgrading so I'm going to close this

@jstarry jstarry closed this May 30, 2024
@mergify mergify bot deleted the mergify/bp/v1.18/pr-794 branch May 30, 2024 23:09
@AshwinSekar
Copy link

v1.18 is basically prod now that mainnet beta is upgrading so I'm going to close this

If we want to speed up experimentation, I think it could be worth it to acquire a couple of staked canaries to run with this flag. That would give us a couple months of confidence so that we can make this the default choice in 2.0.0.

@jstarry
Copy link

jstarry commented Jun 8, 2024

I've been running this on a staked node for months already and it works great

@AshwinSekar
Copy link

I've been running this on a staked node for months already and it works great

oh awesome. If that's the case, can we backport the metric only? That should give us enough data to do a comparison between 1.18 and your staked node.

If the comparison looks good we can make this the default choice in 2.0, no flag necessary.

apfitzge pushed a commit to apfitzge/agave that referenced this pull request Aug 26, 2025
only reroute if relayer connected (anza-xyz#123)
feat: add client tls config (anza-xyz#121)
remove extra val (anza-xyz#129)
fix clippy (anza-xyz#130)
copy all binaries to docker-output (anza-xyz#131)
Ledger tool halts at slot passed to create-snapshot (anza-xyz#118)
update program submodule (anza-xyz#133)
quick fix for tips and clearing old bundles (anza-xyz#135)
update submodule to new program (anza-xyz#136)
Improve stake-meta-generator usability (anza-xyz#134)
pinning submodule head (anza-xyz#140)
Use BundleAccountLocker when handling tip txs (anza-xyz#147)
Add metrics for relayer + block engine proxy (anza-xyz#149)
Build claim-mev in docker (anza-xyz#141)
Rework bundle receiving and add metrics (anza-xyz#152) (anza-xyz#154)
update submodule + dev files (anza-xyz#158)
Deterministically find tip amounts, add meta to stake info, and cleanup pubkey/strings in MEV tips (anza-xyz#159)
update jito-programs submodule (anza-xyz#160)
Separate MEV tip related workflow (anza-xyz#161)
Add block builder fee protos (anza-xyz#162)
fix jito programs (anza-xyz#163)
update submodule so autosnapshot exits out of ledger tool early (anza-xyz#164)
Pipe through block builder fee (anza-xyz#167)
pull in new snapshot code (anza-xyz#171)
block builder bug (anza-xyz#172)

Pull in new slack autosnapshot submodule (anza-xyz#174)

sort stake meta json and use int math (anza-xyz#176)

add accountsdb conn submod (anza-xyz#169)

Update tip distribution parameters (anza-xyz#177)

new submodules (anza-xyz#180)

Add buildkite link for jito CI (anza-xyz#183)

Fixed broken links to repositories (anza-xyz#184)

Changed from ssh to https transfer for clone

Seg/update submods (anza-xyz#187)

fix tests (anza-xyz#190)

rm geyser submod (anza-xyz#192)

rm dangling geyser references (anza-xyz#193)

fix syntax err (anza-xyz#195)

use deterministic req ids in batch calls (anza-xyz#199)

update jito-programs

revert cargo

update Cargo lock

update with path fix

fix cargo

update autosnapshot with block lookback (anza-xyz#201)

[JIT-460] When claiming mev tips, skip accounts that won't have min rent exempt amount after claiming (anza-xyz#203)

Add logging for sol balance desired (anza-xyz#205)

* add logging

* add logging

* update msg

* tweak vars

update submodule (anza-xyz#204)

use efficient data structures when calling batch_simulate_bundles (anza-xyz#206)

[JIT-504] Add low balance check in uploading merkle roots (anza-xyz#209)

add config to simulate on top of working bank (anza-xyz#211)

rm frozen bank check

simulate_bundle rpc bugfixes (anza-xyz#214)

rm frozen bank check in simulate_bundle rpc method

[JIT-519] Store ClaimStatus address in merkle-root-json (anza-xyz#210)

* add files

* switch to include bump

update submodule (anza-xyz#217)

add amount filter (anza-xyz#218)

update autosnapshot (anza-xyz#222)

Print TX error in Bundles (anza-xyz#223)

add new args to support single relayer and block-engine endpoints (anza-xyz#224)

point to new jito-programs submod and invoke updated init tda instruction (anza-xyz#228)

fix clippy errors (anza-xyz#230)

fix validator start scripts (anza-xyz#232)

Point README to gitbook (anza-xyz#237)

use packaged cargo bin to build (anza-xyz#239)

Add validator identity pubkey to StakeMeta (anza-xyz#226)

The vote account associated with a validator is not a permanent link, so log the validator identity as well.

bugfix: conditionally compile with debug flags (anza-xyz#240)

Seg/tip distributor master (anza-xyz#242)

* validate tree nodes

* fix unit tests

* pr feedback

* bump jito-programs submod

Simplify bootstrapping (anza-xyz#241)

* startup without precompile

* update spacing

* use release mode

* spacing

fix validation

rm validation skip

Account for block builder fee when generating excess tip balance (anza-xyz#247)

Improve docker caching

delay constructing claim mev txs (anza-xyz#253)

fix stake meta tests from bb fee (anza-xyz#254)

fix tests

Buffer bundles that exceed cost model (anza-xyz#225)

* buffer bundles that exceed cost model

clear qos failed bundles buffer if not leader soon (anza-xyz#260)

update Cargo.lock to correct solana versions in jito-programs submodule (anza-xyz#265)

fix simulate_bundle client and better error handling (anza-xyz#267)

update submod (anza-xyz#272)

Preallocate Bundle Cost (anza-xyz#238)

fix Dockerfile (anza-xyz#278)

Fix Tests (anza-xyz#279)

Fix Tests (anza-xyz#281)

* fix tests

update jito-programs submod (anza-xyz#282)

add reclaim rent workflow (anza-xyz#283)

update jito-programs submod

fix clippy errs

rm wrong assertion and swap out file write fn call (anza-xyz#292)

Remove security.md (anza-xyz#293)

demote frequent relayer_stage-stream_error to warn (anza-xyz#275)

account for case where TDA exists but not allocated (anza-xyz#295)

implement better retries for tip-distributor workflows (anza-xyz#297)

limit number of concurrent rpc calls (anza-xyz#298)

Discard Empty Packet Batches (anza-xyz#299)

Identity Hotswap (anza-xyz#290)

small fixes (anza-xyz#305)

Set backend config from admin rpc (anza-xyz#304)

Admin Shred Receiver Change (anza-xyz#306)

Seg/rm bundle UUID (anza-xyz#309)

Fix github workflow to recursively clone (anza-xyz#327)

Add recursive checkout for downstream-project-spl.yaml (anza-xyz#341)

Use cluster info functions for tpu (anza-xyz#345)

Use git rev-parse for git sha

Remove blacklisted tx from message_hash_to_transaction (anza-xyz#374)

Updates bootstrap and start scripts needed for local dev. (anza-xyz#384)

Remove Deprecated Cli Args (anza-xyz#387)

Master Rebase

improve simulate_bundle errors and response (anza-xyz#404)

derive Clone on accountoverrides (anza-xyz#416)

Add upsert to AccountOverrides (anza-xyz#419)

update jito-programs (anza-xyz#430)

[JIT-1661] Faster Autosnapshot (anza-xyz#436)

Reverts simulate_transaction result calls to upstream (anza-xyz#446)

Don't unlock accounts in TransactionBatches used during simulation (anza-xyz#449)

first pass at wiring up jito-plugin (anza-xyz#428)

[JIT-1713] Fix bundle's blockspace preallocation (anza-xyz#489)

[JIT-1708] Fix TOC TOU condition for relayer and block engine config (anza-xyz#491)

[JIT-1710] - Optimize Bundle Consumer Checks (anza-xyz#490)

Add Blockhash Metrics to Bundle Committer (anza-xyz#500)

add priority fee ix to mev-claim (anza-xyz#520)

Update Autosnapshot (anza-xyz#548)

Run MEV claims + reclaiming rent-exempt amounts in parallel. (anza-xyz#582)

Update CI (anza-xyz#584)
- Add recursive submodule checkouts.
- Re-add solana-secondary step

Add more release fixes (anza-xyz#585)

Fix more release urls (anza-xyz#588)

[JIT-1812] Fix blocking mutexs (anza-xyz#495)

 [JIT-1711] Compare the unprocessed transaction storage BundleStorage against a constant instead of VecDeque::capacity() (anza-xyz#587)

Automatically rebase Jito-Solana on a periodic basis. Send message on slack during any failures or success.

Fix periodic rebase anza-xyz#594

Fixes the following bugs in the periodic rebase:
Sends multiple messages on failure instead of one
Cancels entire job if one branch fails

Ignore buildkite curl errors for rebasing and try to keep curling until job times out (anza-xyz#597)

Sleep longer waiting for buildkite to start (anza-xyz#598)

correctly initialize account overrides (anza-xyz#595)

Fix: Ensure set contact info to UDP port instead of QUIC (anza-xyz#603)

Add fast replay branch to daily rebase (anza-xyz#607)

take a snapshot of all bundle accounts before sim (anza-xyz#13) (anza-xyz#615)

update jito-programs submodule

Add 2.0 to daily rebase (anza-xyz#626)

Export agave binaries during docker build (anza-xyz#627)

Buffer bundles that exceed processing time and make the allowed processing time longer (anza-xyz#611)

Publish releases to S3 and GCS (anza-xyz#633)

Rebase from different repos (anza-xyz#637)

Point SECURITY.md to immunefi (anza-xyz#671)

Loosen requirements on tip accounts touchable in BankingStage (anza-xyz#683)

Separate out broadcast + retransmit shredstream (anza-xyz#703)

Add packet flag for staked node (anza-xyz#705)

Add auto-rebase to v2.1 (anza-xyz#739)

Fix release github (anza-xyz#745)

Move block_cost_limit tracking to BankingStage in preparation for SIMD-0207 (anza-xyz#753)

Add precompile checks in BundleStage (anza-xyz#787)

Add auto-rebase to v2.2 (anza-xyz#818)

Add better error handling around missing transaction signatures for bundle id generation (anza-xyz#860)

Remove unwrap from authentication (anza-xyz#861)

Revert Jito-Solana WorkingBankEntry changes (anza-xyz#873)

BP anza-xyz#885: Add libclang to Dockerfile (anza-xyz#886)

Remove the tip distributor code (anza-xyz#888)

Rebase: Update anchor to not use deprecated crates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants