-
Notifications
You must be signed in to change notification settings - Fork 530
CORS-3440: IngressController subnet selection at installation #1634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORS-3440: IngressController subnet selection at installation #1634
Conversation
|
@gcs278: This pull request references CORS-3440 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.17.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
84a2ab7 to
d6a259a
Compare
6a9976a to
12b9995
Compare
|
@gcs278: This pull request references CORS-3440 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.17.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
12b9995 to
8e71d45
Compare
|
/assign |
|
/assign @miheer |
|
|
||
| This enhancement also deprecates the existing install-config field `platform.aws.subnets` | ||
| in favor for a more flexible configuration field that handles both IngressController subnet | ||
| selection (ingress subnets) and cluster resource subnet selection (cluster subnets). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have the latter as a use case. Why is it being covered in this proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed in slack, but for completeness: it's an existing feature, the platform.aws.subnets field in the install-config. I'm giving this type of subnet a proper name for clarity.
Note that I also also changed the "cluster resource subnet" / "cluster subnet" to be "cluster-role subnets", and "ingress subnets" to "IngressController-role subnets" and have included a new ## Definitions and Terminology section early in the EP to define these, as well as my motivation for why I am call them as such.
| AZs than the cluster subnets, the default IngressController may map to subnets | ||
| in the AZs that are not part of the cluster. This creates a security risk | ||
| because the load balancer is mapping to subnets that may belong to other clusters. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| AZs than the cluster subnets, the default IngressController may map to subnets | |
| in the AZs that are not part of the cluster. This creates a security risk | |
| because the load balancer is mapping to subnets that may belong to other clusters. This | |
| AZs than the cluster subnets, there is no mechanism for the default IngressController | |
| to know which subnets and AZs are a part of the cluster. | |
| This lack of discoverable information creates a security risk | |
| because the load balancer may map to subnets that may belong to other clusters. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a bit more nuanced than there is no mechanism for the default IngressController to know which subnets and AZs are a part of the cluster.
There is a mechanism, it's the kubernetes.io/cluster/<cluster-id> tag for the subnet. However, it's a combination of the two facts:
- The cluster-admins may not have the ability to change this subnet tag.
- The AWS CCM is very generous in it's selection subnets. If there is no
kubernetes.io/cluster/<cluster-id>, it selects it. Only when there is an non-matchingkubernetes.io/cluster/<cluster-id>for another cluster, it will ignore it.
These two reasons may lead to a situation where the IngressController/LoadBalancer is picking up unwanted subnets, and those unwanted subnets could be public subnets when you have a internal/private IngressController.
I realize this subnet selection info is probably worth adding to the EP somewhere, and I'll try to capture your suggestion to be more specific.
|
Very detailed enhancement. Minor comments inline. |
gcs278
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @patrickdillon @JoelSpeed @sadasu, comments should be addressed and ready for re-review.
| A user-created IngressController is a custom IngressController that is created after installation (Day 2). | ||
| It represents all IngressControllers except for the `default` IngressController. | ||
|
|
||
| ### Defining Subnets Roles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed in slack.
I now see @patrickdillon's point that separating the two into external & internal, it can provide the opportunity for encoding the desired scope of the subnet for future installer-provisioned VPCs (not BYO) use cases. Though, as we discussed, we still may want to have the user explicitly provide the scope of the subnet via platform.aws.vpc.subnets.type for installer-provisioned subnets.
Either way, I think we have options for the future, and considering these scenarios is a good way to test extensibility of the current design proposal.
| - It selects subnets that lack the `kubernetes.io/cluster/<cluster-id>` tag ([OCPBUGS-17432](https://issues.redhat.com/browse/OCPBUGS-17432)) | ||
| - It doesn't support use cases such as [Dedicated Load Balancer Subnets User Story](#dedicated-load-balancer-subnets-user-story) | ||
|
|
||
| The goal of this enhancement is not to completely bypass or replace subnet discovery by hardcoding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree, but technically it is already within the motivation section as a sub-section. My intention was to introduce subnet discovery, then talk about the goals/motivation for it, let me know if that doesn't make sense.
| It's important to note that while the installer will reject additional untagged subnets during installation, | ||
| a cluster admin can still add untagged subnets after installation (Day 2), which may cause IngressControllers | ||
| to automatically use these new subnets. This proposal does not attempt to prevent this type of misconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When considering the problems with installer validation of this, one thing that came to mind was whether the ingress controller could emit some sort of alert in this case. I don't know much about this topic, so not sure if it's viable/appropriate.
@patrickdillon Yea it's an interesting idea. I think the alert is the right direction, but I'm not sure where the right place to put the logic. Ingress Operator kinda makes sense, but it'd need to reconcile AWS Subnets, which feels a bit of a stretch given it's current responsibilities. Not impossible. I think this something we'd punt on until we have a need.
Is this accurate?
@sadasu Correct, that's the goal.
JoelSpeed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking this merging from my perspective, LGTM
Picked up one thought while doing a final pass
| [Reject BYO VPC Installations that Contain Untagged Subnets](#reject-byo-vpc-installations-that-contain-untagged-subnets), | ||
| are not inadvertently applied to the old API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this feature is introduced in a y-stream, do we consider this behaviour a breaking change still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, but I'm not sure if there is a clear answer. From my perspective it depends on how customers would interpret the validation, where they could say either:
a. I've been using this VPC in prior versions and am surprised to discover ingress LBs are getting sent to subnets I didn't specify; or,
b. I've been using this VPC to install prior versions and it's always worked, now it's broken. I don't care where the ingress controller goes. More work! 😭
I don't have any insight which way the responses would fall. I think we have a good justification for enforcing the validation on the old field, but the hesitation is that we "break" existing workflows. Personally I have a weakly held opinion that the slow introduction of the validation through the new field is all we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @patrickdillon response. If we did enforce it on the "old" API (platform.aws.subnets), compatibility is a grey area. Though there is a workaround (aka tag the subnets), we still would be changing installer behavior, so strictly speaking, it doesn't feel like a compatible update.
But to be clear, I wasn't intending for this validation to happen when the old (platform.aws.subnets) field is specified, only when the new field (platform.aws.vpc.subnets) is specified. So, as the EP is currently written, I don't see this validation as a breaking change in any x-y-z stream update because it's only enforced with the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh totally, as the EP stands today there's no breaking changes, everything is very safe. Contrary to normal, I was actually wondering if we are being overly cautious here. It's fine as it is, but I don't think it would cause too much outrage if we did enforce the new validation (and therefore risk fewer bugs in the future?) on the older API too, as long as it's a Y stream change.
We have made much larger Y stream changes in the installer in the past (see Terraform to CAPI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as it is, but I don't think it would cause too much outrage if we did enforce the new validation (and therefore risk fewer bugs in the future?) on the older API too, as long as it's a Y stream change.
Fair point. Though I like the safety of requiring users to "opt in" to the new validation via the new API, but I guess your point is that Y-streams already require some extra care from our customers. I can go either way if anyone feels strongly.
|
/approve Great job @gcs278! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unassign @miheer |
mtulio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcs278 great work!!! 🥇 I left a few comments to increase consistency in edge zones, feel free if you want to change in a follow up PR or now, anyway LGTM. Holding just to prevent auto-merge before your decision, feel free to drop it. :)
/lgtm
/hold
| All public and private subnets specified in the `platform.aws.vpc.subnets` field, regardless of their role, must be | ||
| tagged with the `kubernetes.io/cluster/<cluster-id>: shared` tag. Currently, edge subnets are not tagged by the | ||
| installer and should continue to be excluded. These tags assist the AWS CCM in identifying which subnets to use | ||
| during [AWS Subnet Discovery](#aws-subnet-discovery) for both the cluster being installed and any other clusters | ||
| that may be installed within the same VPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, edge subnets are not tagged by the installer and should continue to be excluded.
Well, well, this is a good call! :)
Why not we are tagging kube cluster tag as shared edge subnets in BYO VPC deployments?! Just validated that indeed it isn't tagged as shared in BYO VPC, and full IPI is tagged by owned. Trying to rescue in top of my mind if there was restrictions in CCM, but I can't recall, specially only BYO VPC - IIRC we are skipping edge in the CCM discovery as network-based LB isn't supported: https://github.com/kubernetes/cloud-provider-aws/pull/499/files#diff-97a31d0f193ea0b4e2c538595daf93405351054377458053a5010f6fdb97f4dcR3547-R3551
I just file a bug to review that: https://issues.redhat.com/browse/OCPBUGS-48827
I believe we can keep this text here as this is the current behavior, if you want to reference the OCPBUGS would be nice. :)
fwiw @patrickdillon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow interesting way to expose a potential bug. Good catch!
Updated with a link.
| - type: IngressControllerLB | ||
| - type: ControlPlaneExternalLB | ||
| - type: Bootstrap | ||
| - id: subnet-0fcf8e0392f0910e0 # Private / Local Zone us-east-dca-1a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if dca Local Zone exists, suggest to keep a valid just for consistency:
| - id: subnet-0fcf8e0392f0910e0 # Private / Local Zone us-east-dca-1a | |
| - id: subnet-0fcf8e0392f0910e0 # Private / Local Zone us-east-1-dfw-1a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the public edge subnets is also valid, the installer will not use it when creating machineset manifests, but it can be used in day-2 to deploy ingress (eg using ALBO/ALBC), or even deploying nodes in public subnets when patching manifests[2] (useful in edge zones where ALB isn't supported yet). Commenting here just to make sure we'll not apply restrictions/validations to private-only on edge subnets:
- id: subnet-0fcf8e0392f0910e1 # Public / Local Zone us-east-1-dfw-1a
roles:
- type: EdgeNodeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I wasn't aware that we allowed edge nodes to use public subnets currently. I will add a note about that.
Adds enhancements/installer/aws-lb-subnet-selection.md for specifying IngressController's LoadBalancer-type Service subnets at installation time. This enhancement also deprecates the platform.aws.subnets field for a more flexible and extensible API structure.
416bd1f to
30f44ee
Compare
gcs278
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @mtulio. I've also squashed commits since we seem like reviews are slowing down.
| All public and private subnets specified in the `platform.aws.vpc.subnets` field, regardless of their role, must be | ||
| tagged with the `kubernetes.io/cluster/<cluster-id>: shared` tag. Currently, edge subnets are not tagged by the | ||
| installer and should continue to be excluded. These tags assist the AWS CCM in identifying which subnets to use | ||
| during [AWS Subnet Discovery](#aws-subnet-discovery) for both the cluster being installed and any other clusters | ||
| that may be installed within the same VPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow interesting way to expose a potential bug. Good catch!
Updated with a link.
| [Reject BYO VPC Installations that Contain Untagged Subnets](#reject-byo-vpc-installations-that-contain-untagged-subnets), | ||
| are not inadvertently applied to the old API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as it is, but I don't think it would cause too much outrage if we did enforce the new validation (and therefore risk fewer bugs in the future?) on the older API too, as long as it's a Y stream change.
Fair point. Though I like the safety of requiring users to "opt in" to the new validation via the new API, but I guess your point is that Y-streams already require some extra care from our customers. I can go either way if anyone feels strongly.
| - type: IngressControllerLB | ||
| - type: ControlPlaneExternalLB | ||
| - type: Bootstrap | ||
| - id: subnet-0fcf8e0392f0910e0 # Private / Local Zone us-east-dca-1a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I wasn't aware that we allowed edge nodes to use public subnets currently. I will add a note about that.
|
/hold cancel |
|
Opps sorry @patrickdillon, I wanted to make sure @candita got a look. I should have communicated that here. /hold |
|
@gcs278: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| The installer must not allow **any** public IngressControllerLB subnets for private clusters (`installconfig.publish: Internal`) | ||
| as this is a security risk as discussed in the [Do Not Use Untagged Subnets User Stories](#do-not-use-untagged-subnets-user-stories) section. | ||
|
|
||
| Conversely, the installer must now allow **any** private IngressControllerLB subnets for public clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gcs278 Adding this for future reference and removing the hold to signal my review. If I'm wrong about this, please let me know.
| Conversely, the installer must now allow **any** private IngressControllerLB subnets for public clusters | |
| Conversely, the installer must not allow **any** private IngressControllerLB subnets for public clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yea that was a typo, good catch. If I have time tomorrow I might follow up with a tiny pr to fix this.
|
Thanks for another stellar enhancement proposal @gcs278! |
Adds enhancements/installer/aws-lb-subnet-selection.md for specifying IngressController's LoadBalancer-type Service subnets at installation time.
This enhancement also deprecates the platform.aws.subnets field for a more flexible and extensible API structure.