Skip to content

Conversation

liamaharon
Copy link
Collaborator

@liamaharon liamaharon commented Aug 7, 2025

Phase 2 in #1887

Phase 2 additions:

  • Applies the OTF Patched Polkadot SDK
  • Added code to automatically sync Aura keystore with the Babe keystore
  • Add -otf suffix to subtensor crates that share the same name as a polkadot-sdk crate. This is to prevent name collisions when applying patches, and ensure we don't accidentally use the wrong crate.
  • Fixed this vulnerability raised by @shamil-gadelshin
  • Fixed edge case issue where node would not automatically switch from Babe to Aura service

Update Aug 26

I discovered there is an issue warp syncing a chain that contains an Aura to Babe migration. The warp sync would succeed up until the first Babe block is mined, and then revert to a regular sync.

Root Cause

Polkadot SDK does not allow starting a service with warp sync if the database is in a partially synced state: https://github.com/opentensor/polkadot-sdk/blob/d13f915d8a1f55af53fd51fdb4544c47badddc7e/substrate/client/network/sync/src/strategy/warp.rs#L235-L252

In the initial implementation of this PR, while a node is warp syncing, it will restart part-way through the sync when it detects the first Babe block.

At the time of the service restart, the node has a partially synced database. This causes the previously referenced code to terminate the warp sync, and fall back to a regular sync.

Solution

To resolve this issue, we must warp sync the entire chain (Aura AND Babe blocks) without restarting the service.

This is achieved in two parts:

First, I prevented the Aura service switching to a Babe service while the node is syncing. This was achieved by adding new code here:

let syncing = sync_service.status().await.is_ok_and(|status| status.warp_sync.is_some() || status.state_sync.is_some());

Second, I added support for the node to import Babe blocks while running an Aura service. This was achieved by replacing the AuraWrappedImportQueue with a HybridImportQueue that contains

  • A HybridBlockImport that contains inner full implementations of AuraBlockImport and BabeBlockImport
  • A HybridVerifier that contains inner full implementations of the AuraVerifier and BabeVerifier
  • An import_queue function that builds an ImportQueue implementation capable of completely importing both Aura and Babe blocks.

The Aura service is required to construct a BabeConfiguration to pass to the hybrid import_queue, so it can import the first Babe block it encounters. This required me to pull in some Babe runtime configuration from #1708 into this PR, specifically the BABE_GENESIS_EPOCH_CONFIG and EPOCH_DURATION_IN_BLOCKS.

With these runtime constants, we are able to construct what our initial Babe configuration will be while running an Aura service:

/// Returns what the Babe configuration is expected to be at the first Babe block.
///
/// This is required for the hybrid import queue, so it is ready to validate the first encountered
/// babe block(s) before switching to Babe consensus.
fn get_expected_babe_configuration<B: BlockT, C>(
client: &C,
) -> sp_blockchain::Result<BabeConfiguration>
where
C: AuxStore + ProvideRuntimeApi<B> + UsageProvider<B>,
C::Api: AuraApi<B, AuraAuthorityId>,
{
let at_hash = if client.usage_info().chain.finalized_state.is_some() {
client.usage_info().chain.best_hash
} else {
client.usage_info().chain.genesis_hash
};
let runtime_api = client.runtime_api();
let authorities = runtime_api
.authorities(at_hash)?
.into_iter()
.map(|a| (BabeAuthorityId::from(a.into_inner()), 1))
.collect();
let slot_duration = runtime_api.slot_duration(at_hash)?.as_millis();
let epoch_config = node_subtensor_runtime::BABE_GENESIS_EPOCH_CONFIG;
let config = sp_consensus_babe::BabeConfiguration {
slot_duration,
epoch_length: node_subtensor_runtime::EPOCH_DURATION_IN_SLOTS,
c: epoch_config.c,
authorities,
randomness: Default::default(),
allowed_slots: epoch_config.allowed_slots,
};
Ok(config)
}

Summary

With the two changes described in "Solution", when a node is warp syncing, it will warp sync the entire chain (all Aura and Babe blocks) and import the state entirely before it switches to running a Babe service. This resolves the issue root cause of the issue, which is that the node would restart mid-warp-sync.

It is important this phase is merged prior to phase 3, so node operators have time to upgrade in advance of the runtime upgrade.

Steps to simulate Aura -> Babe migration with finney state

  1. Set up https://github.com/opentensor/baedeker-for-subtensor. Ask Greg if you have questions.
  2. Build Babe NPoS runtime from Permissioned Babe NPoS Runtime #1708
$ git checkout node-decentralization
$ cargo b -r -p node-subtensor && cp ./target/release/wbuild/node-subtensor-runtime/node_subtensor_runtime.compact.compressed.wasm ./babe-npos.wasm
  1. Build node from this branch
$ git checkout hybrid-node
$ rm ./target/release/node-subtensor && cargo b -r -p node-subtensor
  1. Run Baedeker using node from this branch
$ cd ../baedeker-for-subtensor
$ ./localnet-baedeker.sh
  1. Upgrade to Babe NPoS runtime. Ask Liam if you have questions.

QA Checklist

  • --initial-consensus aura gracefully switches to Babe post-upgrade
  • --initial-consensus babe gracefully switches to Aura pre-upgrade
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime
  • Babe upgrade works when enacted in early blocks
    • devnet-ready state/runtime
    • hybrid-node state/runtime
  • Babe upgrade works when enacted in later blocks (where era would be >1)
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime
  • Warp sync Aura & Babe chain works when switch happened in early blocks
    • devnet-ready
    • hybrid-node
  • Warp sync Aura & Babe chain works when switch happened in later blocks (where era would be >1)
    • devnet-ready state/runtime
    • hybrid-node state/runtime
    • baedeker-finney state/runtime

@liamaharon liamaharon mentioned this pull request Aug 7, 2025
15 tasks
@liamaharon liamaharon force-pushed the hybrid-node branch 2 times, most recently from 7d8246c to 483729e Compare August 7, 2025 23:27
@liamaharon liamaharon changed the title Hybrid Consensus Node + Full Support for Aura -> Babe Runtime Upgrades Support for Aura -> Babe Runtime Upgrades Aug 7, 2025
@liamaharon liamaharon marked this pull request as ready for review August 7, 2025 23:43
@liamaharon liamaharon marked this pull request as draft August 7, 2025 23:45
@liamaharon liamaharon changed the title Support for Aura -> Babe Runtime Upgrades Node Support for Aura -> Babe Runtime Upgrades Aug 8, 2025
@liamaharon liamaharon marked this pull request as ready for review August 8, 2025 04:38
@liamaharon liamaharon requested a review from sam0x17 August 8, 2025 04:40
@liamaharon liamaharon force-pushed the hybrid-node branch 3 times, most recently from eaba58d to 5d2d0e2 Compare August 8, 2025 05:17
update Cargo.lock
@shamil-gadelshin
Copy link
Collaborator

shamil-gadelshin commented Sep 3, 2025

The conflict is that these crate names are identical with crates in polkadot-sdk.
This means when we patch the Parity polkadot-sdk crates with the OTF polkadot-sdk, there is a risk that we will accidentally overwrite our crates with the polkadot-sdk versions.
By making the names of our crates unique, it is impossible for us to accidentally patch them.
I can change the name to put -subtensor- in the middle instead of -otf prefix if you prefer.

After the conversation with @gztensor we confirmed that we use "-subtensor-" for such cases. @liamaharon , please, rename the crates when you have time.

@liamaharon
Copy link
Collaborator Author

Thanks @gregzaitsev and @shamil-gadelshin, I've replaced the -otf with -subtensor- as suggested.

Copy link
Collaborator

@shamil-gadelshin shamil-gadelshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Consider merging the latest devnet-ready to counter conflicts created by the benchmarking bot.

@liamaharon
Copy link
Collaborator Author

Thanks @shamil-gadelshin , I have merged devnet-ready and also updated the Polkadot SDK fork to include your sc-transaction-pool change.

gztensor
gztensor previously approved these changes Sep 8, 2025
@sam0x17
Copy link
Contributor

sam0x17 commented Sep 8, 2025

just some more conflicts, btw will prob want to hold off on deploying this until next week because we are doing 2 quick deploys this week that need to be low-risk

@sam0x17 sam0x17 added the skip-cargo-audit This PR fails cargo audit but needs to be merged anyway label Sep 28, 2025
@sam0x17 sam0x17 merged commit 40945a9 into devnet-ready Sep 28, 2025
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-cargo-audit This PR fails cargo audit but needs to be merged anyway
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants