-
Notifications
You must be signed in to change notification settings - Fork 46
Fix OOM error in large parentchain syncs with sidechain feature. #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e arg to a borrow
# Conflicts: # service/src/parentchain_handler.rs
# Conflicts: # core/parentchain/block-import-dispatcher/src/lib.rs # core/parentchain/block-import-dispatcher/src/triggered_dispatcher.rs # service/src/main_impl.rs
brenzi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I manually tested our privacy sidechain demo (two rococo-local parachains, one sidechain worker) with this and all works
| parentchain_id | ||
| ); | ||
| self.block_importer | ||
| .import_parentchain_blocks(blocks, events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this simple is_syncing bool flag is not enough
- It needs to be clear to other validateers what parentchain block is the first one which must be considered for indirect invocation (in SCV and OCW mode). therefore, that "first relevant parentchain block should be stored in the STF state. Most likely, we should define this as the block when the primary worker enclave got registered for the first time, because the shard state will only be written to once the primary worker is in sync and starts to process parentchain blocks. caveat: if the shardid differs from mrenclave and the shardid gets registered a lot later, we will be importing too many blocks - but I don't think this is critical
- callin it "is_syncing" is misleading. we're always syncing. I think it should be called "is_initial_sync" or similar
- for the initial sync, the events and indirect invocations shouldn't even be searched for and they shouldn't be validated using proofs to accelerate the process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with all points. I think there was a reason why it has been implemented the way it is now, but I forgot, and I don't see a reason why this is currently like that.
- Substrate introduced the term
is_major_sync, which is true whenever a node is far behind the best block. This concept does not really apply in our case, as we will send the light-client db to an outdated worker. However, what I am trying to say is that we stay somewhat in the substrate-jargon if we useis_initial_sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be more descriptive to call the flag ignore_invocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, if this is the only thing that this flag is going to be used for yes, but I am not 100% if it is. Regardless, as long as we keep it like that, I agree. It could also be something like: verify_only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we could also use this as some point maybe for non-authoring sidechain rpc-nodes, if we ever want to have them.
The problem behind the OOM error was that we had synced the whole parentchain before we imported the blocks into the light client. This PR changes the behaviour to trigger the import of the parentchain blocks after each chunk in the sync_parentchain_and_import_blocks function. The behaviour of sync_parentchain has not been changed.
This is because it is either used by non-primary workers, which have only small syncs, as they receive a light-client-db from the primary worker at startup, or it is used in the teeracle/offchain worker modes, which directly import the parentchain blocks when they are synced.
However, naively fixing this lead to another suboptimal behaviour: During the initial sync of the parentchain, we always put 1000 blocks and their corresponding events into the queue, just to be popped immediately afterwards by a subsequent ffi-ecall. This was so inefficient that the syncing became very slow. Hence, I introduced an
is_syncingflag into thesync_parentchainecall, which bypasses the queuing process and directly imports it, this also made thetrigger_parentchain_block_importecall obsolete, thus and I removed it.Another minor fix is included that: