-
Notifications
You must be signed in to change notification settings - Fork 253
SIMD-0072: Feature Gate Threshold Automation #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SIMD-0072: Feature Gate Threshold Automation #72
Conversation
ripatel-fd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this SIMD - I'd like to ask about some clarification.
On first read, it wasn't immediately clear to me whether this change will install more human approval or less. Would it be possible to clarify in the summary whether:
- Validator software will automatically signal all of its supported features when started, without further prompt from the operator.
- Validator operators will now have to manually arm for feature gates via a validator CLI command.
The summary mentions that the process is automated and manual human action is removed, which would imply 1. But further below it mentions that the operator runbook now includes the new transaction step, which would imply 2.
FWIW, I oppose mechanism 1 because it'd make it virtually impossible to deploy a Labs-independent client safely. For various reasons, I expect Firedancer (validator software) to take a bit longer to implement new feature gates, and have a small minority of stake for the forseeable future. I'm worried that in mechanism 1, Labs nodes would just overrule other nodes all the time, and activate features that are not supported by all implementations.
@ripatel-fd : However, the more fundamental principle of this proposal is that actual activation of the feature is automatic based on percentage of stake support, default 95%. While the Labs client no longer represents anywhere near 95% of stake, it's true that no one entity or client implementation (including Firedancer) would be able to block activation en masse until they represent 5+% of stake. Even if this SIMD is approved and implemented, feature-gate key-holders can continue to negotiate with all the clients to determine when to queue the feature for activation, as we do now -- and we envision separate SIMDs around formal governance on approving/queuing proposed features. So this SIMD wouldn't prevent the level of human control that currently exists. |
lheeger-jump
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this proposal is a much needed one. As I have mentioned though, in my comments, I think we need a separate SIMD for the new program and everything about how it will work.
I also think some consideration as to how this can tie into a governance-based mechanism would be good. For example, governance could approve the creation of new features, the description of such a vote would specify exactly what the new feature will do.
Good stuff so far!
|
We (@ripatel-fd and I) spent some more time discussing this proposal and we had several more comments on it:
We propose that there is 3 phase system for feature gate ratification:
We would be happy to collaborate on proposal(s)/specification(s) which go in the above direction. As it stands, it introduces the aforementioned bootstrapping issue, which would give no recourse for the community to dissent to new features. This could prolong the development of clients with less resources. More importantly than our own concerns though (which are admittedly self-interested) this proposal would further Solana Labs' hegemony on the chain via their control of client features. This may be against the wants of validators who actually control the blockspace of Solana. It should not be the version of the Solana Labs client or the clients implicit capabilities which control the blockspace, it must be the people. |
@lheeger-jump , thanks for your thoughts. To clarify, the system we envision is 2 phases:
To me, these two steps are synonymous to the Version Availability and Activation steps you describe, with the same essential stake and participation requirements, just in the opposite order. Can you please expand on how changing the order of these steps changes the amount of control any particular client wields? |
|
Thanks for clarifying @CriesofCarrots. I should have read the SIMD more carefully. I think we are ideologically aligned in that this is only a "feature support discovery" mechanism, and that the actual activation logic can be specified in a separate PR. This is definitely a step in the right direction. The above criticism regarding risk to another clients that me and @lheeger-jump posted is therefore invalid. We had some thoughts about how the permissioned part can be removed (adding new features to the queue). It would involve changing the negotiation API like so:
Example API enum FeatureInstruction {
/// Initializes a feature account
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
Initialize,
/// Closes a feature account without nominating it
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
ClosesFeature,
/// Advertise support for feature. Can only be called by current leader, once per rotation.
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` Current leader
LeaderAdvertise,
/// Nominates a feature for a governance vote.
/// # Account references
/// 0. `[writable]` The feature account
/// 1. `[signer]` The feature authority
Nominate,
}The disadvantage of this change is that it increases the amount of transactions and data required to activate a feature. |
This should be more clear in the SIMD and should be very explicit. Thanks for the clarification. Nonetheless, the arming mechanism described is relevant to the discussion. |
The 3 steps I described are to allow anyone to create a new feature, but not to allow them to be able to be the activator (that would be governance). I think what I am saying is that anyone can do creation (i.e. anyone can propose new features) and that is separate from activation which should not be voted on. I think you basically say that here:
By splitting up the first step into creation and then activation after version availability is satisfied, you disallow the feature owner from ramming the feature on and the choice to enable is explicit. Less about client control. The client control is more about a lack of governance on feature activation. |
546b3ab to
d1817fa
Compare
ca3d674 to
2497aee
Compare
|
Hey everyone! It's time to dust this one off again now that we have #88 and #89 merged! We've introduced a completely revised version of this proposal, re-written from scratch. It should assess all of the discussion points raised above in previous threads, as well as numerous concerns raised in various offline discussions. I'd love it if we can open back up the review process for this one, and start working toward a new feature activation process! @CriesofCarrots @lheeger-jump @ripatel-fd P.S. I've included a link to the Feature Gate program's reference implementation in the PR description. |
|
@ripatel-fd and I can take another look in the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, added some clarifying comments/questions, the answers to which would be good to have in the SIMD text.
Basically trying to be explicit about all assumptions/details, so that a client can implement this SIMD using just this SIMD.
Nice one!
a643bdf to
177b45a
Compare
CriesofCarrots
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One substantive question, otherwise mostly suggestions for clarity.
| Transactions sent too late (< 4500 slots before the epoch end) will be rejected | ||
| by the Feature Gate program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need any deadline/restriction? I see the program cache being alluded to, but I don't see any specification that changes to the Staged Features PDA must begin or conclude at some point earlier than the epoch boundary. It's not obvious why the Feature Gate program can't continue to modify the Staged Features PDA up to the epoch boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to explicitly explain the reasoning for this constraint in the SIMD, or are you asking why we actually need such a constraint? In either case, I agree it should be more explicit in the proposal.
It seems to me that @Lichtso intends to increase the cache preparation phase's initiation to be sooner in the epoch (comment), so we probably need a more concrete explanation as to why the Feature Gate program should have this constraint.
In this proposal, it was initially set to 128 slots remaining in the epoch because the cache preparation stage begins at a maximum of 128 slots before the end of the epoch, as computed here.
It's worth calling out that if we configure this constraint in the Feature Gate program and then later need to adjust it for an earlier cache preparation start, we'll be bound to using a feature gate for that change to the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking why we need a constraint at all. The cache being referenced is supposedly an agave-specific implementation, not part of the runtime protocol. If feature activation needs to work around any aspect of cache behavior, that seems like (a) a problem; and (b) a thing that definitely needs to be specified for all clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really nothing stopping the cache from being able to anticipate the upcoming feature environment by querying the PDA under the Feature Gate program that would tell it what the current stake support is for each feature. It's actually arguably equally or more deterministic than the way it's done now.
The cache could run the calculation at its current 128 slots before epoch boundary and probably get a pretty good understanding of which features have enough stake support. Increasing that slot buffer beyond 128 could just mean increasing the likelihood that not enough stake has been signaled for a feature. Perhaps this is a compromise the cache implementation should consider? After all, I believe recompilation is an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit removing the "slots remaining" constraint, since I think the above is a sensible direction to head in. If there's explicit opposition, it can be added back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my imprecise formulation, I meant scheduled for activation.
If a feature is scheduled for activation in some slot close to the boundary, then it is possible that the TX is not yet rooted. Thus, some validator nodes will not have it scheduled, and will not activate it on the boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain how this is different from the cluster coming to consensus on any other tx/block on which other logic depends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For features which only affect their own fork this is irrelevant. But for features which change validator global behavior this would break. And I am not sure if there is much to be gained from making exceptions or harving different feature types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding an additional point of clarity here, if it helps.
Right now, in Agave, the cache is calling compute_active_feature_set, which will compute the feature set based on the on-chain accounts that have been created under Feature11111... (ie. "activated"). The cache does this ~128 slots before the epoch boundary to prepare the cache entries for the anticipated upcoming environment.
With the proposed system in place, the cache can still call compute_active_feature_set the same way it does now. The function will just compute the anticipated feature set from the proposed PDA (with stake support) rather than just the presence of on-chain accounts.
In almost every way it's the same as what we have now. The subtle difference is that someone can no longer activate a feature after the cache has done its feature-set computation, but nodes could still cast votes (signals) on a feature, bringing it to "eligible" status. This edge case exists today and just causes cache preparation/recompilation to be moot.
What @CriesofCarrots is suggesting is that we don't explicitly architect this mechanism around the cache, which is sensible, especially considering the cache preparation may change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lichtso do you still have concerns here? I'd like to move this SIMD along.
4d2b500 to
5ca6da6
Compare
Co-authored-by: Tyera <[email protected]>
b14cad2 to
50f2142
Compare
|
@buffalojoec @Lichtso It looks like you're in agreement on the last discussion. Is there anything outstanding on the PR? |
| The signal account is a PDA derived from the validator vote account address | ||
| to prevent double-signaling. | ||
|
|
||
| ``` | ||
| "support_signal" + vote_address | ||
| ``` | ||
|
|
||
| ```rust | ||
| /// Feature support signal state. | ||
| /// | ||
| /// All `u64` integers are represented in little-endian. | ||
| struct SupportSignal { | ||
| /// The epoch this node last submitted a support signal. | ||
| epoch: u64, | ||
| } | ||
| ``` | ||
|
|
||
| The `epoch` value serves as a reference point for preventing double-signaling | ||
| by ensuring the `epoch` value is less than the current Clock epoch at the time | ||
| of signaling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to throw a spanner in at the last minute, but I was just thinking about how to make it easy for each validator to check that they have sent a support signal for a feature, and this has successfully landed on-chain.
I think it would be super easy if the pda was actually "support_signal" + vote_address + "/" + feature_gate_pubkey. Then each validator, at the start of each slot, can simply look up the PDA for each inactive feature and send any missing support signals.
This would also make it easy to inspect the system, as we can see what's going on on-chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense and does make it easier to inspect if you have already voted for that feature.
One thought, though, is in the support_signal + vote_address case, we have only one PDA support signal account for multiple features that potentially we have to keep updating with the new epoch, but now we have a PDA per feature gate, and if multiple are scheduled, we will be sending a vote for each feature gate per epoch until accepted. We will unlikely have multiple feature gates enabled per epoch, but we do up the potential number of on-chain txs needed.
Having said that, support_signal + vote_address + / + feature account is a neater design imho than having one support signal for multiple feature gates. We could then use epoch from that PDA to mean they support it from that point onwards, and then they do not need to reconfirm each epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my main concern with that would be an increase in fees - but that's only if we move to activate more than one feature per epoch. 🙈
If one day we start to make features smaller and activate more often, we'd be looking at a redesign if we wanted to avoid paying tx fee + account rent for each feature.
What about just storing the signal itself in the PDA alongside the epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean having a list of features that you are signaling for inside the struct SupportSignal right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even just the bitmask that was sent should do. A node querying can just check that against the stage PDA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@buffalojoec do you think it's likely that we start activating multiple features per epoch? I think it's unlikely that the feature cadence increases much more than it is today. Also, rent is being lowered so I don't think the fees should be that much of a concern. This could actually result in less work being done - a validator only has to send one support signal per feature, not per epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I didn't realize you were suggesting that they wouldn't have to track per-epoch. That's not bad!
So you just signal for a feature once and you never have to do it again, and if you ever want to revoke your signal you can. That's pretty smooth for the nodes.
The tallying would still work the same way, it just wouldn't track epochs at all, just tallies per feature ID, right? We may need to adjust the data structures involved, but it would be mechanically similar, just simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the feature is active, we would just tally it towards every future epoch until the feature is activated/revoked or the signal is revoked. That's a neat design.
We also would need to adjust SignalSupportForStagedFeatures() to now pass in all the PDAs, right? I wonder if we should have the feature name or pubkey inside the SupportSignal struct, so that it is easier to match up the PDA support signal and feature gates, or I guess it's not too bad to do the computation on-chain
| feature accounts after the first three required accounts. | ||
|
|
||
| ``` | ||
| Instruction: SignalSupportForStagedFeatures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these SignalSupportForStagedFeatures transactions be treated as regular ones or fast-tracked in a similar way to vote transactions? I assume the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, former. I had no intention of special-casing them.
|
Hey @buffalojoec, this PR has been inactive for 60 days and is now marked as stale. If you're still working on this, please comment below to let us know. Otherwise, this PR will be automatically closed in 7 days. To keep this PR active, simply leave a comment explaining the current status or any blockers you're facing. |

This is SIMD 3/4 expected for Multi-Client Feature Gates. See #76
Summary
This SIMD outlines a proposal for automating the feature activation process
based on a stake-weighted support threshold, rather than manual human action.
With this new process, contributors no longer have to assess stake support for a
feature before activation. Instead, the assessment is done by the runtime.
A reference implementation for the Feature Gate program introduced in #89 and mentioned in this SIMD can be found here.