-
Notifications
You must be signed in to change notification settings - Fork 111
OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state #801
Conversation
|
@liouk: This pull request explicitly references no jira issue. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdded HasSynced gating and stricter kube-apiserver node CurrentRevision validation to OIDC availability checks; introduced a reusable test fake informer with configurable HasSynced; and wired AuthConfigChecker informers into multiple controllers and the operator workload wiring. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)
71-78: LGTM! Logic correctly filters invalid revisions.The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.
One minor style nitpick:
- numNodesWithEmptyRevision += 1 + numNodesWithEmptyRevision++The
++operator is more idiomatic in Go for simple increments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
pkg/controllers/common/external_oidc.go(1 hunks)pkg/controllers/common/external_oidc_test.go(1 hunks)
🔇 Additional comments (6)
pkg/controllers/common/external_oidc.go (3)
80-82: Good validation: catch missing node status data early.Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.
84-86: Excellent validation: reject partial/invalid node data.Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.
88-90: Approve defensive check, though technically unreachable.This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:
- If
len(kas.Status.NodeStatuses) == 0, line 80-82 returns early- If all nodes have
CurrentRevision <= 0, line 84-86 returns early- If any nodes have
CurrentRevision > 0,observedRevisionswill have entriesThe check serves as a safety net and is acceptable to keep, especially in a WIP PR.
pkg/controllers/common/external_oidc_test.go (3)
35-36: LGTM! Test correctly expects error for missing node statuses.The updated expectation aligns with the new validation in
OIDCAvailable()that returns an error when no node statuses are found.
37-47: LGTM! Test coverage for partial zero revisions.This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.
48-58: LGTM! Test coverage for all zero revisions.This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.
|
/test e2e-oidc-techpreview |
| if len(kas.Status.NodeStatuses) == 0 { | ||
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this before the for loop that iterates through the node statuses?
| } | ||
|
|
||
| observedRevisions := sets.New[int32]() | ||
| numNodesWithEmptyRevision := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track this with a counter-like variable?
Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track this with a counter-like variable?
We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍
Presumably this is equivalent to
len(kas.Status.NodeStatuses) - observedRevision.Len()if we are only tracking > 0 current revisions inobservedRevision?
It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.
71dfa10 to
4d280bd
Compare
| nodesWithEmptyRevision := false | ||
| for _, nodeStatus := range kas.Status.NodeStatuses { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| if nodeStatus.CurrentRevision > 0 { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| } else { | ||
| nodesWithEmptyRevision = true | ||
| } | ||
| } | ||
|
|
||
| if nodesWithEmptyRevision { | ||
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?
As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.
| nodesWithEmptyRevision := false | |
| for _, nodeStatus := range kas.Status.NodeStatuses { | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| if nodeStatus.CurrentRevision > 0 { | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| } else { | |
| nodesWithEmptyRevision = true | |
| } | |
| } | |
| if nodesWithEmptyRevision { | |
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") | |
| for _, nodeStatus := range kas.Status.NodeStatuses { | |
| if nodeStatus.CurrentRevision <= 0 { | |
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") | |
| } | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course -- now that we don't use the count this is much better 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you still want to take this suggestion?
It looks like this is still outstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! This one slipped through. Fixed it now.
|
This PR is to solve the separate issue I saw in another test #798 (comment) . Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already Step 1 At 10:51:14, the upgrade completed: Step 3 So the verification fails. @liouk |
|
Added debug logging to investigate the issue found by @xingxingxia. /hold |
90f2f82 to
702bf57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)
79-120: Use a verbose log level for the new debug statements. These[debug-801]messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g.klog.V(4)) or add an explicit verbosity check.- klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) + klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) @@ - klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced()) + klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced()) @@ - klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision) + klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
pkg/controllers/common/external_oidc.go(3 hunks)pkg/libs/endpointaccessible/endpoint_accessible_controller.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/libs/endpointaccessible/endpoint_accessible_controller.go
|
/jira refresh |
|
@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
45ba4f8 to
3eba97f
Compare
|
@xingxingxia I've provided a fix for the observed behavior; the issue was that some controllers (the ones that aren't managing any resources, but rather running checks) were not tracking the informers needed to check for OIDC configuration availability. As a result, during upgrade, the informers were being used before having synced. Originally this was done on purpose, in order to avoid the overhead of tracking and reacting to changes in those informers, as these controllers are not actively managing any operands, so relying on their next sync was supposedly sufficient. However I had not anticipated this edge-case. Since these informers aren't expected to get changes frequently (two cluster singletons, one configmap informer for the kas namespace), I believe being consistent with synced caches is more important than this overhead. Therefore the fix in 3eba97f. |
|
/retest-required |
|
@liouk: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@liouk: This pull request references Jira Issue OCPBUGS-65675, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
everettraven
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is still one outstanding suggestion.
Other than that, this LGTM.
| nodesWithEmptyRevision := false | ||
| for _, nodeStatus := range kas.Status.NodeStatuses { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| if nodeStatus.CurrentRevision > 0 { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| } else { | ||
| nodesWithEmptyRevision = true | ||
| } | ||
| } | ||
|
|
||
| if nodesWithEmptyRevision { | ||
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you still want to take this suggestion?
It looks like this is still outstanding.
49f961c to
d6af55f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controllers/common/external_oidc_test.go (1)
285-303: Likely typo: duplicateconfig-11configmap in “two nodes ready” scenarioIn the
"oidc getting disabled, rollout in progress, two nodes ready"case theconfigMapsslice contains"config-11"twice and no"config-12":cm("config-11", "config.yaml", kasConfigJSONWithOIDC), cm("config-11", "config.yaml", kasConfigJSONWithOIDC), cm("config-13", "config.yaml", kasConfigJSONWithoutOIDC),Because the indexer keys by name/namespace, the second
"config-11"overwrites the first, and this scenario won’t actually exercise a distinctconfig-12revision despite the surrounding tests and node statuses implying 11/12/13 should all be present. This weakens coverage for the “two nodes ready” disabling rollout.Suggest correcting the second entry to
config-12:- cm("config-11", "config.yaml", kasConfigJSONWithOIDC), + cm("config-12", "config.yaml", kasConfigJSONWithOIDC),
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)
94-96: Unreachable code: this check can never be true.With the current logic:
- Line 82 returns if
len(kas.Status.NodeStatuses) == 0- Line 88-90 returns if any
CurrentRevision <= 0- Otherwise, line 91 inserts into
observedRevisionsSo after the loop,
observedRevisions.Len() >= 1is guaranteed. This condition can never trigger.Consider removing the dead code:
- if observedRevisions.Len() == 0 { - return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no observed revisions found") - }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (11)
pkg/controllers/common/external_oidc.go(2 hunks)pkg/controllers/common/external_oidc_test.go(18 hunks)pkg/controllers/deployment/deployment_controller.go(2 hunks)pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go(1 hunks)pkg/controllers/ingressstate/ingress_state_controller.go(1 hunks)pkg/controllers/oauthendpoints/oauth_endpoints_controller.go(3 hunks)pkg/controllers/proxyconfig/proxyconfig_controller.go(1 hunks)pkg/controllers/readiness/wellknown_ready_controller.go(1 hunks)pkg/controllers/routercerts/controller_test.go(2 hunks)pkg/operator/starter.go(1 hunks)test/library/informer.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go
- pkg/controllers/oauthendpoints/oauth_endpoints_controller.go
- pkg/controllers/readiness/wellknown_ready_controller.go
- pkg/controllers/routercerts/controller_test.go
- pkg/operator/starter.go
- test/library/informer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/controllers/ingressstate/ingress_state_controller.gopkg/controllers/deployment/deployment_controller.gopkg/controllers/common/external_oidc.gopkg/controllers/proxyconfig/proxyconfig_controller.gopkg/controllers/common/external_oidc_test.go
🧬 Code graph analysis (3)
pkg/controllers/ingressstate/ingress_state_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
pkg/controllers/common/external_oidc_test.go (1)
test/library/informer.go (1)
NewFakeSharedIndexInformerWithSync(13-18)
🔇 Additional comments (7)
pkg/controllers/ingressstate/ingress_state_controller.go (1)
63-69: Informer wiring correctly gates OIDC checks on synced cachesHooking
common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)into the controller factory’sWithInformersset cleanly ensures the OIDC-related informers are tracked and mustHasSyncedbeforesyncruns. This aligns with the PR’s goal of avoiding upgrade-time races with unsynced caches, without altering existing control flow.pkg/controllers/common/external_oidc.go (2)
59-69: Appropriate HasSynced guards for the upgrade race condition fix.The upfront sync checks correctly ensure the informer caches are consistent before proceeding, which addresses the root cause of the upgrade-time race described in the PR objectives.
82-90: Node status validation looks correct and addresses prior review feedback.The empty node statuses check is now before the loop, and the early return on invalid
CurrentRevisionterminates the loop immediately as previously suggested.pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
61-61: Correctly wires AuthConfigChecker informers to the controller factory.This ensures the factory waits for the authentication, kubeapiservers, and configmaps informers to sync before invoking
sync(), which complements theHasSyncedchecks added inOIDCAvailable().pkg/controllers/common/external_oidc_test.go (2)
23-71: Sync-flagged scenarios and error/availability expectations look correctThe added
authInformerSynced,kasInformerSynced, andcmInformerSyncedflags, plus the new cases for unsynced informers and invalid/zero node revisions, line up well with the intended behavior: failing fast with errors when you can’t reliably infer OIDC state, and otherwise driving availability off the rollout state. No issues from a correctness or maintainability standpoint here.Also applies to: 243-362
371-399: Informer wiring withNewFakeSharedIndexInformerWithSyncis soundSwitching the KAS, auth, and configmap informers over to
test.NewFakeSharedIndexInformerWithSync(...)and updating the indexer keyfuncs tofunc(obj any) (string, error)matches the new informer interfaces and accurately injectsHasSyncedbehavior into the tests. This is a clean, maintainable way to reproduce the original upgrade-time race in a controlled manner.pkg/controllers/deployment/deployment_controller.go (1)
116-133: Informer wiring for AuthConfigChecker looks correct and aligns with PR goalsFactoring cluster-scoped informers into
clusterScopedInformersand appendingAuthConfigCheckerInformerscleanly ensures the workload controller now waits on all relevant caches (ingress, proxy, nodes, and OIDC-related informers) before use. This directly addresses the race around unsynced informers without introducing extra complexity or obvious regressions.
|
Latest push reorganizes some code, no effective change on functionality; verification stands. /verified by @xingxingxia |
|
@liouk: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Also, make the check fail if informers are not synced to avoid false negatives.
d6af55f to
3265312
Compare
|
/verified by @xingxingxia |
|
@liouk: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1)
56-64: AuthConfigChecker informers correctly wired; consider using controller field to avoid duplicate copyAdding
WithInformers(common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)...)ensures this controller won’t callOIDCAvailable()before the underlying informers are synced, which addresses the upgrade race you’re fixing.You now have two copies of
AuthConfigCheckerhere (the struct field and the local value whose address is passed toAuthConfigCheckerInformers). It’s safe because both copies hold references to the same underlying informers, but ifAuthConfigCheckerever gains mutable state, the field and the pointer could diverge. Consider switching the call to use&controller.authConfigCheckerinstead, to keep a single canonical instance.Please double‑check that no future code intends to mutate
AuthConfigCheckerstate; if so, updating all similar constructors to pass the struct field pointer will avoid subtle bugs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (11)
pkg/controllers/common/external_oidc.go(2 hunks)pkg/controllers/common/external_oidc_test.go(18 hunks)pkg/controllers/deployment/deployment_controller.go(2 hunks)pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go(1 hunks)pkg/controllers/ingressstate/ingress_state_controller.go(1 hunks)pkg/controllers/oauthendpoints/oauth_endpoints_controller.go(3 hunks)pkg/controllers/proxyconfig/proxyconfig_controller.go(1 hunks)pkg/controllers/readiness/wellknown_ready_controller.go(1 hunks)pkg/controllers/routercerts/controller_test.go(2 hunks)pkg/operator/starter.go(1 hunks)test/library/informer.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/controllers/readiness/wellknown_ready_controller.go
- pkg/controllers/routercerts/controller_test.go
- pkg/controllers/common/external_oidc_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/controllers/proxyconfig/proxyconfig_controller.gopkg/operator/starter.gopkg/controllers/common/external_oidc.gopkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.gopkg/controllers/oauthendpoints/oauth_endpoints_controller.gotest/library/informer.gopkg/controllers/deployment/deployment_controller.gopkg/controllers/ingressstate/ingress_state_controller.go
🧬 Code graph analysis (5)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
pkg/operator/starter.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (2)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)pkg/libs/endpointaccessible/endpoint_accessible_controller.go (1)
NewEndpointAccessibleController(38-67)
pkg/controllers/deployment/deployment_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
pkg/controllers/ingressstate/ingress_state_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
AuthConfigCheckerInformers(46-52)
🔇 Additional comments (7)
pkg/operator/starter.go (1)
133-137: OIDC informers now consistently tracked by workload, static resources, and APIService controllersWiring
AuthConfigCheckerInformersinto:
- the static resource controller via
AddInformer,- the OAuth API server workload via the
WithWorkloadControllerextra informer slice, and- the APIService controller via the trailing variadic informers,
makes the controllers that consult
oidcAvailable/OIDCAvailable()wait for the same caches before syncing. This aligns with the race fix described in the PR and should prevent transient misclassification of OIDC state during upgrades.The
append([]factory.Informer{authenticationOperatorClient.Informer()}, common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)... )...pattern also keeps the operator client informer in the trigger set while extending it cleanly.Please re‑run the external OIDC and upgrade tests (e.g. the e2e‑oidc suites you referenced in the PR) to confirm there are no new transient Degraded/Available flips in these controllers now that they depend on the synced AuthConfigChecker informers.
Also applies to: 172-174, 479-499, 583-584
pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
56-63: ProxyConfig checker correctly gated on AuthConfigChecker informersIncluding
common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)in the controller’s informers set matches the newOIDCAvailable()behavior: the factory won’t invokesyncuntil these informers reportHasSynced(), so the proxy checker no longer risks reporting errors based on partially populated OIDC/KAS/configmap caches.Please ensure existing proxy configuration tests (especially around NO_PROXY and external OIDC) still pass, since errors from
OIDCAvailable()will now surface as Degraded viaWithSyncDegradedOnError.Also applies to: 75-81
pkg/controllers/deployment/deployment_controller.go (1)
116-122: Cluster-scoped informer slice refactor + OIDC informers looks solidCreating
clusterScopedInformersand appendingAuthConfigCheckerInformersbefore passing them intoworkload.NewControllerboth improves readability (one place to see all cluster-wide triggers) and guarantees that the oauth-server workload controller waits for the OIDC/KAS/configmap caches it queries viaauthConfigChecker.No functional issues stand out; the slice composition and variadic use are idiomatic.
It would be good to confirm that any tests asserting oauth‑server deployment behavior around OIDC transitions (e.g., when disabling the integrated OAuth server) still behave as expected with this extended informer set.
Also applies to: 123-135
pkg/controllers/ingressstate/ingress_state_controller.go (1)
63-71: IngressState controller now correctly tracks OIDC-related informersAdding
AuthConfigCheckerInformersto the controller’s informer set aligns with the stricterOIDCAvailable()implementation. The controller will only attempt the OIDC‑based short‑circuit (and subsequent ingress endpoints health evaluation) once the authentication, kube‑apiserver, and relevant configmap caches are in sync, which should eliminate the transient upgrade‑time status flips you were seeing.Please verify that the ingress endpoints–related Degraded conditions during upgrade behave as expected now (i.e., no spurious flips when caches are still warming).
Also applies to: 83-91
pkg/controllers/common/external_oidc.go (1)
46-52: OIDCAvailable now robustly validates informer sync and kube-apiserver node statusThe tightened
OIDCAvailable()logic looks correct and aligns with the upgrade-race fix:
HasSynced()checks for all three informers (Authentications, KubeAPIServers, and KAS namespace ConfigMaps) ensure you never evaluate OIDC state against partially populated caches.- Explicitly erroring when
kas.Status.NodeStatusesis empty and when any node hasCurrentRevision <= 0prevents ambiguous “false but healthy” outcomes when the apiserver rollout has not yet produced valid node status entries.- The existing per-revision checks (auth-config and config ConfigMaps plus config.yaml content) now operate only on validated revisions, which keeps the success path semantically unchanged while removing the old silent-success-on-empty-observed-set behavior.
This should turn the previous transient misclassifications into clear, actionable errors while the new informer wiring prevents those errors from appearing once caches are actually synced.
Given the broader behavior change (errors instead of silent
falsein several edge cases), please confirm that all controllers and tests consumingOIDCAvailable()(including the endpoint-accessible controllers and any helper likeoidcAvailable(...)in this package) correctly surface these errors as Degraded or log them as intended, and that you have coverage for:
- empty
NodeStatuses,- nodes with non-positive
CurrentRevision,- informers not yet synced.
Also applies to: 59-69, 77-92, 94-118
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (1)
51-75: OAuth endpoint health controllers correctly extended with AuthConfigChecker informersFor all three controllers (
NewOAuthRouteCheckController,NewOAuthServiceCheckController, andNewOAuthServiceEndpointsCheckController), constructing a local[]factory.Informerwith the existing route/service/endpoints/configmap triggers and then appendingcommon.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)cleanly ties the endpoint-accessible controllers to the same caches thatOIDCAvailable()inspects.Combined with the stricter
OIDCAvailable()preconditions, this should eliminate the previous situation where endpoint health checks ran against unsynced authentication/KAS/configmap state and briefly flipped conditions during upgrades.Please confirm that the endpoint-accessible controllers still behave as expected when OIDC is disabled (checks run) vs enabled (checks short-circuit via
endpointCheckDisabledFunc), especially around upgrades where informers are catching up.Also applies to: 84-106, 116-138
test/library/informer.go (1)
8-38: Test helper cleanly exposes configurable HasSynced for informers
FakeSharedIndexInformerWithSyncis a straightforward way to decouple the lister from an informer whoseHasSynced()behavior can be controlled in tests, while still reusingv1helpers.NewFakeSharedIndexInformer()for the underlying implementation. This should make it much easier to exercise the newOIDCAvailable()sync gating logic without impacting production code paths.When you wire this helper into tests for
AuthConfigCheckerand the controllers usingAuthConfigCheckerInformers, please ensure you cover bothhasSynced = false(expecting errors / no sync) andhasSynced = true(normal behavior) to validate the new race protections.
everettraven
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: everettraven, liouk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@liouk: Jira Issue Verification Checks: Jira Issue OCPBUGS-65675 Jira Issue OCPBUGS-65675 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.20 |
|
@xingxingxia: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #814 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
No description provided.