Skip to content

Conversation

@clangenb
Copy link
Contributor

@clangenb clangenb commented Nov 15, 2023

The problem behind the OOM error was that we had synced the whole parentchain before we imported the blocks into the light client. This PR changes the behaviour to trigger the import of the parentchain blocks after each chunk in the sync_parentchain_and_import_blocks function. The behaviour of sync_parentchain has not been changed.

This is because it is either used by non-primary workers, which have only small syncs, as they receive a light-client-db from the primary worker at startup, or it is used in the teeracle/offchain worker modes, which directly import the parentchain blocks when they are synced.

However, naively fixing this lead to another suboptimal behaviour: During the initial sync of the parentchain, we always put 1000 blocks and their corresponding events into the queue, just to be popped immediately afterwards by a subsequent ffi-ecall. This was so inefficient that the syncing became very slow. Hence, I introduced an is_syncing flag into the sync_parentchain ecall, which bypasses the queuing process and directly imports it, this also made the trigger_parentchain_block_import ecall obsolete, thus and I removed it.

Another minor fix is included that:

@clangenb clangenb changed the title Cl/fix sidechain oom Fix OOM error in large parentchain syncs with sidechain feature. Nov 15, 2023
@clangenb clangenb added A0-core Affects a core part A3-sidechain B1-releasenotes C3-medium 📣 Elevates a release containing this PR to "medium priority" E0-breaksnothing labels Nov 15, 2023
# Conflicts:
#	service/src/parentchain_handler.rs
# Conflicts:
#	core/parentchain/block-import-dispatcher/src/lib.rs
#	core/parentchain/block-import-dispatcher/src/triggered_dispatcher.rs
#	service/src/main_impl.rs
@clangenb clangenb requested a review from brenzi November 30, 2023 00:21
Copy link
Collaborator

@brenzi brenzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I manually tested our privacy sidechain demo (two rococo-local parachains, one sidechain worker) with this and all works

@brenzi brenzi merged commit 597c787 into master Nov 30, 2023
parentchain_id
);
self.block_importer
.import_parentchain_blocks(blocks, events)
Copy link
Collaborator

@brenzi brenzi Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this simple is_syncing bool flag is not enough

  • It needs to be clear to other validateers what parentchain block is the first one which must be considered for indirect invocation (in SCV and OCW mode). therefore, that "first relevant parentchain block should be stored in the STF state. Most likely, we should define this as the block when the primary worker enclave got registered for the first time, because the shard state will only be written to once the primary worker is in sync and starts to process parentchain blocks. caveat: if the shardid differs from mrenclave and the shardid gets registered a lot later, we will be importing too many blocks - but I don't think this is critical
  • callin it "is_syncing" is misleading. we're always syncing. I think it should be called "is_initial_sync" or similar
  • for the initial sync, the events and indirect invocations shouldn't even be searched for and they shouldn't be validated using proofs to accelerate the process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all points. I think there was a reason why it has been implemented the way it is now, but I forgot, and I don't see a reason why this is currently like that.

  • Substrate introduced the term is_major_sync, which is true whenever a node is far behind the best block. This concept does not really apply in our case, as we will send the light-client db to an outdated worker. However, what I am trying to say is that we stay somewhat in the substrate-jargon if we use is_initial_sync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be more descriptive to call the flag ignore_invocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, if this is the only thing that this flag is going to be used for yes, but I am not 100% if it is. Regardless, as long as we keep it like that, I agree. It could also be something like: verify_only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we could also use this as some point maybe for non-authoring sidechain rpc-nodes, if we ever want to have them.

@clangenb clangenb deleted the cl/fix-sidechain-oom branch December 18, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A0-core Affects a core part A3-sidechain B1-releasenotes C3-medium 📣 Elevates a release containing this PR to "medium priority" E0-breaksnothing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

startup error: spam the loop to get parentchain blocks until it has finalized the sync target. sidechain oom & preserve db

3 participants