Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix initial sync loop
  • Loading branch information
tdimitrov committed Feb 13, 2023
commit 27b0cf53f1b5de5116c851cce8c90c44cc28d7c6
72 changes: 39 additions & 33 deletions node/overseer/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -804,40 +804,46 @@ where
Event::Stop => return self.handle_stop().await,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern: I think with current impl Ctrl-C before major sync is complete should not stall/deadlock subsystems, but would be nice to double check that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this stall the subsystems?

Event::BlockImported(block) => _ = self.block_imported(block).await,
Event::BlockFinalized(block) if self.sync_oracle.finished_syncing() => {
// Send initial `ActiveLeaves`
let update = self.block_imported(block.clone()).await;
if !update.is_empty() {
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?;
} else {
// In theory we can receive `BlockImported` during initial major sync. In this case the
// update will be empty.
let span = match self.span_per_active_leaf.get(&block.hash) {
Some(span) => span.clone(),
None => {
// This should never happen.
gum::warn!(
target: LOG_TARGET,
?block.hash,
?block.number,
"Span for active leaf not found. This is not expected"
);
let span = Arc::new(jaeger::Span::new(block.hash, "leaf-activated"));
span
}
};
let update = ActiveLeavesUpdate::start_work(ActivatedLeaf {
hash: block.hash,
number: block.number,
status: LeafStatus::Fresh,
span,
});
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need this branch? When we are receving a block import notification while doing major sync, we should not even land in the match block here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the finality has stalled and no BlockImported events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate.

Right now this is solved by issuing an ActiveLeavesUpdate on startup with the leaves from DB. This is causing problems though because subsystems start work for old leaves, the state from them is usually pruned and some subsystems generate a bunch of logs on stratup.

Then we decided to send this initial ActiveLeavesUpdate after the major sync is done. But here I see two cases:

  1. Finality is stalled or we don't get BlockImported events for some reason. -> In this case we won't issue ActiveLeavesUpdate and the subsystems will be 'inactive'.
  2. Everything is fine and we get BlockImported at some point after the initial major sync. We don't need any special handling in this case.

So the branch you are referring to tries to solve1. We just got BlockFinalized, the initial major sync is complete and I'm trying to guess if an ActiveLeavesUpdate will be generated.

What I do:

  • Generate an artificial BlockImported event by calling self.block_imported(..). It is supposed to generate non-empty ActiveLeaves in 99% of the cases but what if it doesn't?
  • The fat else handles this 'what it doesn't' case with all the stuff that can go wrong.

I wrote tests to cover some of the cases and today I'm thinking about how to simplify this. I don't like it too :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the finality has stalled and no BlockImported events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate.

I don't get the connection? When finality stalls, we will still import new blocks?

  1. Finality is stalled or we don't get BlockImported events for some reason. -> In this case we won't issue ActiveLeavesUpdate and the subsystems will be 'inactive'.

This shouldn't happen.

I mean in general you could track when major sync is finished and then send the ActiveLeavesUpdate for all current leaves. However, I'm not really sure that we need this as there will always be some block import (yes with stalled finality it will be slower, but there will be blocks imported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the connection? When finality stalls, we will still import new blocks?

For me 'finality has stalled' means 'no new blocks are generated'. I'm misunderstanding something here - can you explain why this is not true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are never not producing any new blocks. This is not going to happen, we may slow down block production but not more. There is block production and then there is finality. The block production is for building new blocks and put them on top of the chain. Finality is the process of saying that a certain chain up to a certain point can not be modified anymore, because between the last finalized block and the tip of the chain you can have forks, re-orgs and also go back to older blocks to produce on them. However, the moment we have finalized a block, we can not go before this block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are never not producing any new blocks.

I see... so this means after the initial sync we'll always have BlockImported events and all the precautions in my PR make no sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 👀

I mean it can happen that we don't produce any blocks, but then we have much bigger problems as Parachain subsystems not working properly 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what is even the point of acting on a leaf read from DB? Worst thing that can happen is that a single node does not work immediately on a leaf after a quick restart, which should be no problem at all. Hence, I think this can be simplified by simply not triggering ActiveLeavesUpdate on startup on a leaf read from db, but only for regular block import events.

Subsystems which do care about older blocks, do traverse the ancestry already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's far more elegant this way!

}


// Initial sync is complete
self.block_finalized(&block).await;
// Send initial `ActiveLeaves`
for (hash, number) in self.active_leaves.clone().into_iter() {
let span = match self.span_per_active_leaf.get(&hash) {
Some(span) => span.clone(),
None => {
// This should never happen. Spans are generated in `on_head_activated`
// which is called from `block_imported`. Despite not sending a signal
// `BlockImported` events are handled so a span should exist for each
// active leaf.
gum::error!(
target: LOG_TARGET,
?hash,
?number,
"Span for active leaf not found. This is not expected"
);
let span = Arc::new(jaeger::Span::new(hash, "leaf-activated"));
self.span_per_active_leaf.insert(hash, span.clone());
span
}
};

let update = ActiveLeavesUpdate::start_work(ActivatedLeaf {
hash,
number,
status: LeafStatus::Fresh,
span,
});
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?;
}
// Send initial `BlockFinalized`
self.broadcast_signal(OverseerSignal::BlockFinalized(block.hash, block.number)).await?;
break
let update = self.block_finalized(&block).await;
if !update.is_empty() {
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?;
}

// Send initial `BlockFinalized`
self.broadcast_signal(OverseerSignal::BlockFinalized(block.hash, block.number)).await?;
break
},
Event::BlockFinalized(block) => _ = self.block_finalized(&block).await,
Event::ExternalRequest(request) => self.handle_external_request(request)
Expand Down