-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete
#6689
Conversation
ActiveLeavesUpdate and BlockFinalized events in overseer until major sunc is completeActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete
8b59b56 to
447698d
Compare
447698d to
2244cff
Compare
Overseer won't generate any events to the subsystems until initial full sync is complete.
2244cff to
70d031d
Compare
ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is completeActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete
node/overseer/src/lib.rs
Outdated
| self.handle_external_request(request); | ||
| } | ||
| Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?, | ||
| Event::Stop => return self.handle_stop().await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concern: I think with current impl Ctrl-C before major sync is complete should not stall/deadlock subsystems, but would be nice to double check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this stall the subsystems?
|
I've added some tests to cover the cases we care for + some corner ones. The final step is to refactor the loops so that code duplication is minimal 😟 |
node/overseer/src/lib.rs
Outdated
| self.handle_external_request(request); | ||
| } | ||
| Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?, | ||
| Event::Stop => return self.handle_stop().await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this stall the subsystems?
node/overseer/src/lib.rs
Outdated
| // In theory we can receive `BlockImported` during initial major sync. In this case the | ||
| // update will be empty. | ||
| let span = match self.span_per_active_leaf.get(&block.hash) { | ||
| Some(span) => span.clone(), | ||
| None => { | ||
| // This should never happen. | ||
| gum::warn!( | ||
| target: LOG_TARGET, | ||
| ?block.hash, | ||
| ?block.number, | ||
| "Span for active leaf not found. This is not expected" | ||
| ); | ||
| let span = Arc::new(jaeger::Span::new(block.hash, "leaf-activated")); | ||
| span | ||
| } | ||
| }; | ||
| let update = ActiveLeavesUpdate::start_work(ActivatedLeaf { | ||
| hash: block.hash, | ||
| number: block.number, | ||
| status: LeafStatus::Fresh, | ||
| span, | ||
| }); | ||
| self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need this branch? When we are receving a block import notification while doing major sync, we should not even land in the match block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the finality has stalled and no BlockImported events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate.
Right now this is solved by issuing an ActiveLeavesUpdate on startup with the leaves from DB. This is causing problems though because subsystems start work for old leaves, the state from them is usually pruned and some subsystems generate a bunch of logs on stratup.
Then we decided to send this initial ActiveLeavesUpdate after the major sync is done. But here I see two cases:
- Finality is stalled or we don't get
BlockImportedevents for some reason. -> In this case we won't issueActiveLeavesUpdateand the subsystems will be 'inactive'. - Everything is fine and we get
BlockImportedat some point after the initial major sync. We don't need any special handling in this case.
So the branch you are referring to tries to solve1. We just got BlockFinalized, the initial major sync is complete and I'm trying to guess if an ActiveLeavesUpdate will be generated.
What I do:
- Generate an artificial
BlockImportedevent by callingself.block_imported(..). It is supposed to generate non-emptyActiveLeavesin 99% of the cases but what if it doesn't? - The fat else handles this 'what it doesn't' case with all the stuff that can go wrong.
I wrote tests to cover some of the cases and today I'm thinking about how to simplify this. I don't like it too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the finality has stalled and no
BlockImportedevents are generated the subsystems in polkadot won't start because they initiate work onActiveLeavesUpdate.
I don't get the connection? When finality stalls, we will still import new blocks?
- Finality is stalled or we don't get
BlockImportedevents for some reason. -> In this case we won't issueActiveLeavesUpdateand the subsystems will be 'inactive'.
This shouldn't happen.
I mean in general you could track when major sync is finished and then send the ActiveLeavesUpdate for all current leaves. However, I'm not really sure that we need this as there will always be some block import (yes with stalled finality it will be slower, but there will be blocks imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the connection? When finality stalls, we will still import new blocks?
For me 'finality has stalled' means 'no new blocks are generated'. I'm misunderstanding something here - can you explain why this is not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are never not producing any new blocks. This is not going to happen, we may slow down block production but not more. There is block production and then there is finality. The block production is for building new blocks and put them on top of the chain. Finality is the process of saying that a certain chain up to a certain point can not be modified anymore, because between the last finalized block and the tip of the chain you can have forks, re-orgs and also go back to older blocks to produce on them. However, the moment we have finalized a block, we can not go before this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are never not producing any new blocks.
I see... so this means after the initial sync we'll always have BlockImported events and all the precautions in my PR make no sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 👀
I mean it can happen that we don't produce any blocks, but then we have much bigger problems as Parachain subsystems not working properly 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what is even the point of acting on a leaf read from DB? Worst thing that can happen is that a single node does not work immediately on a leaf after a quick restart, which should be no problem at all. Hence, I think this can be simplified by simply not triggering ActiveLeavesUpdate on startup on a leaf read from db, but only for regular block import events.
Subsystems which do care about older blocks, do traverse the ancestry already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's far more elegant this way!
Co-authored-by: Bastian Köcher <[email protected]>
e680a77 to
6bdb390
Compare
6ead261 to
f8467d8
Compare
|
The CI pipeline was cancelled due to failure one of the required jobs. |
a813802 to
2e14f46
Compare
eskimor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. Haven't we just agreed that all that is needed is removal of the initial artificial ActiveLeavesUpdate?
| return Ok(()); | ||
| } | ||
| }, | ||
| Event::BlockImported(block) if self.sync_oracle.is_syncing() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we still need the oracle? If I understand correctly, we don't get any BlockImported events anyway during sync.
|
I'm closing this one in favor of #6727 Let's try the simpler solution first. |
Adds major sync oracle to overseer. The oracle is used to detect when the node has completed an initial full sync.
The PR also changes the behavior of
ActiveLeavesandBlockFinalizedevents generation. Now they are not sent until the full initial sync is complete.Partially addresses paritytech/polkadot-sdk#793. Fixes #6694.