| Start Date | 03 November 2023 |
| Description | An evenly-distributing indirection layer between availability chunks and validators. |
| Authors | Alin Dima |
Propose a way of randomly permuting the availability chunk indices assigned to validators for a given core and relay chain block, in the context of recovering available data from systematic chunks, with the purpose of fairly distributing network bandwidth usage.
Currently, the ValidatorIndex is always identical to the ChunkIndex. Since the validator array is only shuffled once per session, naively using the ValidatorIndex as the ChunkIndex would pose an unreasonable stress on the first N/3 validators during an entire session, when favouring availability recovery from systematic chunks.
Therefore, the relay chain node needs a deterministic way of evenly distributing the first ~(N_VALIDATORS / 3) systematic availability chunks to different validators, based on the session, relay chain block number and core. The main purpose is to ensure fair distribution of network bandwidth usage for availability recovery in general and in particular for systematic chunk holders.
Relay chain node core developers.
An erasure coding algorithm is considered systematic if it preserves the original unencoded data as part of the resulting code. The implementation of the erasure coding algorithm used for polkadot's availability data is systematic. Roughly speaking, the first N_VALIDATORS/3 chunks of data can be cheaply concatenated to retrieve the original data, without running the resource-intensive and time-consuming reconstruction algorithm.
Here's the concatenation procedure of systematic chunks for polkadot's erasure coding algorithm (minus error handling, for briefness):
pub fn reconstruct_from_systematic<T: Decode>(
n_validators: usize,
chunks: Vec<&[u8]>,
) -> T {
let mut threshold = (n_validators - 1) / 3;
if !is_power_of_two(threshold) {
threshold = next_lower_power_of_2(threshold);
}
let shard_len = chunks.iter().next().unwrap().len();
let mut systematic_bytes = Vec::with_capacity(shard_len * kpow2);
for i in (0..shard_len).step_by(2) {
for chunk in chunks.iter().take(kpow2) {
systematic_bytes.push(chunk[i]);
systematic_bytes.push(chunk[i + 1]);
}
}
Decode::decode(&mut &systematic_bytes[..]).map_err(|err| Error::Decode(err))
}At this moment, the availability recovery process looks different based on the estimated size of the available data:
(a) for small PoVs (up to 128 Kib), sequentially try requesting the unencoded data from the backing group, in a random order. If this fails, fallback to option (b).
(b) for large PoVs (over 128 Kib), launch N parallel requests for the erasure coded chunks (currently, N has an upper limit of 50), until enough chunks were recovered. Validators are tried in a random order. Then, reconstruct the original data.
As part of the effort of increasing polkadot's resource efficiency, scalability and performance, work is under way to modify the Availability Recovery subsystem by leveraging systematic chunks. See this comment for preliminary performance results.
In this scheme, the relay chain node will first attempt to retrieve the ~N/3 systematic chunks from the validators that should hold them, before falling back to recovering from regular chunks, as before.
The function that decides the chunk index for a validator should be parameterized by at least
(validator_index, session_index, block_number, core_index)
and have the following properties:
- deterministic
- pseudo-random
- relatively quick to compute and resource-efficient.
- when considering the other params besides
validator_indexas fixed, the function should describe a random permutation of the chunk indices - considering
session_indexandblock_numberas fixed arguments, the validators that map to the first N/3 chunk indices should have as little overlap as possible for different cores.
In other words we want a uniformly distributed, deterministic mapping from ValidatorIndex to ChunkIndex per block per core.
Pseudocode:
pub fn get_chunk_index(
n_validators: u32,
validator_index: ValidatorIndex,
session_index: SessionIndex,
block_number: BlockNumber,
core_index: CoreIndex
) -> ChunkIndex {
let threshold = systematic_threshold(n_validators); // Roughly n_validators/3
let seed = derive_seed(session_index, block_number);
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);
let mut chunk_indices: Vec<ChunkIndex> = (0..n_validators).map(Into::into).collect();
chunk_indices.shuffle(&mut rng);
let core_start_pos = threshold * core_index.0;
chunk_indices[(core_start_pos + validator_index) % n_validators]
}The function should be implemented as a runtime API, because:
- it enables further atomic changes to the shuffling algorithm.
- it enables alternative client implementations (in other languages) to use it
- considering how critical it is for parachain consensus that all validators have a common view of the Validator->Chunk
mapping, this mitigates the problem of third-party libraries changing the implementations of the
ChaCha8Rngor therand::shufflethat could be introduced in further versions. This would stall parachains if only a portion of validators upgraded the node.
Additionally, so that client code is able to efficiently get the mapping from the runtime, another API will be added for retrieving chunk indices in bulk for all validators at a given block and core:
pub fn get_chunk_indices(
n_validators: u32,
session_index: SessionIndex,
block_number: BlockNumber,
core_index: CoreIndex
) -> Vec<ChunkIndex> {
let threshold = systematic_threshold(n_validators); // Roughly n_validators/3
let seed = derive_seed(session_index, block_number);
let mut rng: ChaCha8Rng = SeedableRng::from_seed(seed);
let mut chunk_indices: Vec<ChunkIndex> = (0..n_validators).map(Into::into).collect();
chunk_indices.shuffle(&mut rng);
let core_start_pos = threshold * core_index.0;
chunk_indices
.into_iter()
.cycle()
.skip(core_start_pos)
.take(n_validators)
.collect()
}Considering that the Validator->Chunk mapping is critical to para consensus, the change needs to be enacted atomically via governance, only after all validators have upgraded the node to a version that is aware of this mapping. It needs to be explicitly stated that after the runtime upgrade and governance enactment, validators that run older client versions that don't support this mapping will not be able to participate in parachain consensus.
Availability-recovery can currently be triggered by the following steps in the polkadot protocol:
- During the approval voting process.
- By other collators of the same parachain
- During disputes.
The core_index refers to the index of the core that the candidate was occupying while it was pending availability
(from backing to inclusion).
Getting the right core index for a candidate can prove troublesome. Here's a breakdown of how different parts of the
node implementation can get access to this data:
-
The approval-voting process for a candidate begins after observing that the candidate was included. Therefore, the node has easy access to the block where the candidate got included (and also the core that it occupied).
-
The
pov_recoverytask of the collators starts availability recovery in response to noticing a candidate getting backed, which enables easy access to the core index the candidate started occupying. -
Disputes may be initiated on a number of occasions:
3.a. is initiated by the validator as a result of finding an invalid candidate while participating in the approval-voting protocol. In this case, availability-recovery is not needed, since the validator already issued their vote.
3.b is initiated by the validator noticing dispute votes recorded on-chain. In this case, we can safely assume that the backing event for that candidate has been recorded and kept in memory.
3.c is initiated as a result of getting a dispute statement from another validator. It is possible that the dispute is happening on a fork that was not yet imported by this validator, so the subsystem may not have seen this candidate being backed.
As a solution to 3.c, a new version for the disputes request-response networking protocol will be added. This message type will include the relay block hash where the candidate was included. This information will be used in order to query the runtime API and retrieve the core index that the candidate was occupying.
The usage of the systematic data availability recovery feature will also be subject to all nodes using the V2 disputes networking protocol.
As an alternative to core_index, the ParaId could be used. It has the advantage of being readily available in the
CandidateReceipt, which would enable the dispute communication protocol to not change and would simplify the
implementation.
However, in the context of CoreJam, ParaIds will no longer exist
(at least not in their current form).
Using the candidate hash as a random seed for a shuffle is another option.
Has a fair amount of technical complexity involved:
-
Introduces another runtime API that is going to be queried by multiple subsystems. With adequate client-side caching, this should be acceptable.
-
Requires a networking protocol upgrade on the disputes request-response protocol
Extensive testing will be conducted - both automated and manual. This proposal doesn't affect security or privacy.
This is a necessary DA optimisation, as reed-solomon erasure coding has proven to be a top consumer of CPU time in polkadot as we scale up the parachain block size and number of availability cores.
With this optimisation, preliminary performance results show that CPU time used for reed-solomon coding can be halved and total POV recovery time decrease by 80% for large POVs. See more here.
Not applicable.
This is a breaking change. See "upgrade path" section above. All validators need to have upgraded their node versions before the feature will be enabled via a runtime upgrade and governance call.
See comments on the tracking issue and the in-progress PR
- Is it a future-proof idea to utilise the core index as a parameter of the chunk index compute function? Is there a better alternative that avoid complicating the implementation?
- Is there a better upgrade path that would preserve backwards compatibility?
This enables future optimisations for the performance of availability recovery, such as retrieving batched systematic chunks from backers/approval-checkers.