OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

liouk · 2025-10-16T09:41:40Z

No description provided.

openshift-ci-robot · 2025-10-16T09:41:44Z

@liouk: This pull request explicitly references no jira issue.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-10-16T09:42:08Z

Walkthrough

Added HasSynced gating and stricter kube-apiserver node CurrentRevision validation to OIDC availability checks; introduced a reusable test fake informer with configurable HasSynced; and wired AuthConfigChecker informers into multiple controllers and the operator workload wiring.

Changes

Cohort / File(s)	Summary
OIDC validation & tests `pkg/controllers/common/external_oidc.go`, `pkg/controllers/common/external_oidc_test.go`	Add upfront informer HasSynced checks (authentications, kubeapiservers, configmaps). Validate `kas.Status.NodeStatuses` exists and each node has `CurrentRevision > 0`; remove previous silent-success path for empty observed revisions. Tests updated to use synced fake shared informers, per-test HasSynced flags, and indexer callbacks changed to `func(obj any)`.
Test informer helper `test/library/informer.go`	Add generic `FakeSharedIndexInformerWithSync[T any]` with `NewFakeSharedIndexInformerWithSync`, `Informer()` and `Lister()` to produce informers whose `HasSynced()` is configurable for tests.
Deployment controller informer grouping `pkg/controllers/deployment/deployment_controller.go`	Create `clusterScopedInformers` (Ingresses, Proxies, Nodes) and augment it with `AuthConfigCheckerInformers`; use this grouped slice when constructing the workload controller.
Controllers wired with AuthConfigChecker informers `pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go`, `pkg/controllers/ingressstate/ingress_state_controller.go`, `pkg/controllers/oauthendpoints/oauth_endpoints_controller.go`, `pkg/controllers/proxyconfig/proxyconfig_controller.go`, `pkg/controllers/readiness/wellknown_ready_controller.go`, `pkg/operator/starter.go`	Wire `common.AuthConfigCheckerInformers` into multiple controllers and operator workload wiring by appending or using `WithInformers(...)`; replace several hard-coded informer lists with augmented slices. No public function signatures changed.
Tests: routercerts & other tests `pkg/controllers/routercerts/controller_test.go`, `pkg/controllers/.../*_test.go` (tests updated similarly)	Replace local fake informer wrappers with `test.NewFakeSharedIndexInformerWithSync(...)`, remove legacy local fakeInformer types, and update test wiring to respect the configured `HasSynced` flag.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Focus review on pkg/controllers/common/external_oidc.go for correctness of HasSynced gating, NodeStatuses presence check, and CurrentRevision logic and error messages.
Verify tests (pkg/controllers/common/external_oidc_test.go, pkg/controllers/routercerts/controller_test.go) correctly instantiate and use test.NewFakeSharedIndexInformerWithSync and that indexer callbacks declared as func(obj any) behave as expected.
Check test/library/informer.go to ensure the fake informer's Informer() and HasSynced() semantics align with test assumptions.
Scan controller wiring changes to ensure appending AuthConfigCheckerInformers does not unintentionally omit previously required informers.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request contains no description provided by the author. The pass criterion for this check requires that "the description is related in some way to the changeset," but an empty or absent description cannot satisfy this requirement as it provides no information to relate to the changes. Although the check is lenient regarding level of detail, it still requires some description to exist and be connected to the changeset, which is not present in this case.	Add a pull request description that explains the motivation and context for returning errors when node statuses cannot be used to determine OIDC state. Even a brief description that relates to the changeset would satisfy this check; for example, explaining the issue being addressed or the benefit of these error handling improvements would help reviewers understand the PR's purpose and context.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state" directly describes the primary changes in the pull request. The main modifications are to the external OIDC component to add validation that returns errors when node statuses are empty, contain zero revisions, or cannot be reliably used to determine OIDC state. The title is specific, clear, and accurately captures the core objective of the changeset. While secondary logging additions exist in the endpoint_accessible_controller component, they are minor compared to the main focus on OIDC error handling improvements.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/controllers/common/external_oidc.go (1)
71-78: LGTM! Logic correctly filters invalid revisions.

The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.

One minor style nitpick:
-			numNodesWithEmptyRevision += 1
+			numNodesWithEmptyRevision++
The ++ operator is more idiomatic in Go for simple increments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 215805c and fc58d2d.

📒 Files selected for processing (2)

pkg/controllers/common/external_oidc.go (1 hunks)
pkg/controllers/common/external_oidc_test.go (1 hunks)

🔇 Additional comments (6)

pkg/controllers/common/external_oidc.go (3)

80-82: Good validation: catch missing node status data early.

Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.

84-86: Excellent validation: reject partial/invalid node data.

Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.

88-90: Approve defensive check, though technically unreachable.

This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:

If len(kas.Status.NodeStatuses) == 0, line 80-82 returns early

If all nodes have CurrentRevision <= 0, line 84-86 returns early

If any nodes have CurrentRevision > 0, observedRevisions will have entries

The check serves as a safety net and is acceptable to keep, especially in a WIP PR.

pkg/controllers/common/external_oidc_test.go (3)

35-36: LGTM! Test correctly expects error for missing node statuses.

The updated expectation aligns with the new validation in OIDCAvailable() that returns an error when no node statuses are found.

37-47: LGTM! Test coverage for partial zero revisions.

This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.

48-58: LGTM! Test coverage for all zero revisions.

This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.

liouk · 2025-10-21T08:37:22Z

/test e2e-oidc-techpreview

everettraven · 2025-10-21T13:12:25Z

pkg/controllers/common/external_oidc.go

+	if len(kas.Status.NodeStatuses) == 0 {
+		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found")
+	}


Could we move this before the for loop that iterates through the node statuses?

everettraven · 2025-10-21T13:14:36Z

pkg/controllers/common/external_oidc.go

 	}

 	observedRevisions := sets.New[int32]()
+	numNodesWithEmptyRevision := 0


Do we need to track this with a counter-like variable?

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

Do we need to track this with a counter-like variable?

We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.

everettraven · 2025-10-24T11:28:55Z

pkg/controllers/common/external_oidc.go

+	nodesWithEmptyRevision := false
 	for _, nodeStatus := range kas.Status.NodeStatuses {
-		observedRevisions.Insert(nodeStatus.CurrentRevision)
+		if nodeStatus.CurrentRevision > 0 {
+			observedRevisions.Insert(nodeStatus.CurrentRevision)
+		} else {
+			nodesWithEmptyRevision = true
+		}
+	}
+
+	if nodesWithEmptyRevision {
+		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")


If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?

As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.

Suggested change

nodesWithEmptyRevision := false

for _, nodeStatus := range kas.Status.NodeStatuses {

observedRevisions.Insert(nodeStatus.CurrentRevision)

if nodeStatus.CurrentRevision > 0 {

observedRevisions.Insert(nodeStatus.CurrentRevision)

} else {

nodesWithEmptyRevision = true

}

}

if nodesWithEmptyRevision {

return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")

for _, nodeStatus := range kas.Status.NodeStatuses {

if nodeStatus.CurrentRevision <= 0 {

return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")

}

observedRevisions.Insert(nodeStatus.CurrentRevision)

}

Of course -- now that we don't use the count this is much better 👍

Did you still want to take this suggestion?

It looks like this is still outstanding.

Of course! This one slipped through. Fixed it now.

xingxingxia · 2025-10-25T12:12:19Z

This PR is to solve the separate issue I saw in another test #798 (comment) .

Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already /verified as I commented in that PR.
For this #801, I pre-merge tested as below:

# Cluster-Bot payload 1
build 4.21.0-0.nightly-2025-10-24-233040,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

# Cluster-Bot payload 2
build 4.21.0-0.nightly-2025-10-25-063101,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

Step 1
Launched a cluster with payload 1. Configured external oidc auth on the cluster. Rollout completed after waiting ~ 20m. Checked oc/console logins et al which all worked.
Step 2
At 09:47:45, starting upgrade to payload 2:

[xxia@2025-10-25 09:47:45 GMT my]$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest # payload 2
...
Requested update to release image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest
[xxia@2025-10-25 09:47:49 GMT my]$

At 10:51:14, the upgrade completed:

[xxia@2025-10-25 10:51:14 GMT my]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest   True        False         39s     Cluster version is 4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest
[xxia@2025-10-25 10:51:16 GMT my]$

Step 3
Checked CAO logs. The issue still happened twice during upgrading, respectively at 10:14:00 and 10:29:20:

[xxia@2025-10-25 10:52:59 GMT my]$ oc get event -n openshift-authentication-operator -o json > events-openshift-authentication-operator.json
[xxia@2025-10-25 10:53:04 GMT my]$ cat events-openshift-authentication-operator.json | jq -r '.items[] | select(.message | test ("Available changed from")) | "\(.firstTimestamp) \(.count) \(.message)"'
...
2025-10-25T10:14:00Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found"),status.relatedObjects changed from [{"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}]
2025-10-25T10:14:01Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")
2025-10-25T10:29:20Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found")
2025-10-25T10:29:23Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")

So the verification fails. @liouk

liouk · 2025-11-03T09:02:38Z

Added debug logging to investigate the issue found by @xingxingxia.

/hold

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/controllers/common/external_oidc.go (1)
79-120: Use a verbose log level for the new debug statements. These [debug-801] messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g. klog.V(4)) or add an explicit verbosity check.
-			klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+			klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
@@
-			klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
+			klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
@@
-			klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
+			klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 90f2f82 and 702bf57.

📒 Files selected for processing (2)

pkg/controllers/common/external_oidc.go (3 hunks)
pkg/libs/endpointaccessible/endpoint_accessible_controller.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

liouk · 2025-11-17T13:56:49Z

/jira refresh

openshift-ci-robot · 2025-11-17T13:56:51Z

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-11-17T13:56:53Z

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

liouk · 2025-11-19T13:43:24Z

@xingxingxia I've provided a fix for the observed behavior; the issue was that some controllers (the ones that aren't managing any resources, but rather running checks) were not tracking the informers needed to check for OIDC configuration availability. As a result, during upgrade, the informers were being used before having synced.

Originally this was done on purpose, in order to avoid the overhead of tracking and reacting to changes in those informers, as these controllers are not actively managing any operands, so relying on their next sync was supposedly sufficient. However I had not anticipated this edge-case.

Since these informers aren't expected to get changes frequently (two cluster singletons, one configmap informer for the kas namespace), I believe being consistent with synced caches is more important than this overhead. Therefore the fix in 3eba97f.

xingxingxia · 2025-11-20T02:14:04Z

/retest-required

openshift-ci · 2025-11-20T05:21:49Z

@liouk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`702bf57`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-11-26T14:00:29Z

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

everettraven

Looks like there is still one outstanding suggestion.

Other than that, this LGTM.

everettraven · 2025-12-01T18:46:00Z

pkg/controllers/common/external_oidc.go

+	nodesWithEmptyRevision := false
 	for _, nodeStatus := range kas.Status.NodeStatuses {
-		observedRevisions.Insert(nodeStatus.CurrentRevision)
+		if nodeStatus.CurrentRevision > 0 {
+			observedRevisions.Insert(nodeStatus.CurrentRevision)
+		} else {
+			nodesWithEmptyRevision = true
+		}
+	}
+
+	if nodesWithEmptyRevision {
+		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")


Did you still want to take this suggestion?

It looks like this is still outstanding.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/controllers/common/external_oidc_test.go (1)
285-303: Likely typo: duplicate config-11 configmap in “two nodes ready” scenario

In the "oidc getting disabled, rollout in progress, two nodes ready" case the configMaps slice contains "config-11" twice and no "config-12":
cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
cm("config-13", "config.yaml", kasConfigJSONWithoutOIDC),
Because the indexer keys by name/namespace, the second "config-11" overwrites the first, and this scenario won’t actually exercise a distinct config-12 revision despite the surrounding tests and node statuses implying 11/12/13 should all be present. This weakens coverage for the “two nodes ready” disabling rollout.

Suggest correcting the second entry to config-12:
-               cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
+               cm("config-12", "config.yaml", kasConfigJSONWithOIDC),

🧹 Nitpick comments (1)

pkg/controllers/common/external_oidc.go (1)
94-96: Unreachable code: this check can never be true.

With the current logic:

Line 82 returns if len(kas.Status.NodeStatuses) == 0

Line 88-90 returns if any CurrentRevision <= 0

Otherwise, line 91 inserts into observedRevisions

So after the loop, observedRevisions.Len() >= 1 is guaranteed. This condition can never trigger.

Consider removing the dead code:
-	if observedRevisions.Len() == 0 {
-		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no observed revisions found")
-	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 49f961c and d6af55f.

📒 Files selected for processing (11)

pkg/controllers/common/external_oidc.go (2 hunks)
pkg/controllers/common/external_oidc_test.go (18 hunks)
pkg/controllers/deployment/deployment_controller.go (2 hunks)
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1 hunks)
pkg/controllers/ingressstate/ingress_state_controller.go (1 hunks)
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (3 hunks)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1 hunks)
pkg/controllers/readiness/wellknown_ready_controller.go (1 hunks)
pkg/controllers/routercerts/controller_test.go (2 hunks)
pkg/operator/starter.go (1 hunks)
test/library/informer.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go
pkg/controllers/readiness/wellknown_ready_controller.go
pkg/controllers/routercerts/controller_test.go
pkg/operator/starter.go
test/library/informer.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/controllers/ingressstate/ingress_state_controller.go
pkg/controllers/deployment/deployment_controller.go
pkg/controllers/common/external_oidc.go
pkg/controllers/proxyconfig/proxyconfig_controller.go
pkg/controllers/common/external_oidc_test.go

🧬 Code graph analysis (3)

pkg/controllers/ingressstate/ingress_state_controller.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/controllers/common/external_oidc_test.go (1)

test/library/informer.go (1)

NewFakeSharedIndexInformerWithSync (13-18)

🔇 Additional comments (7)

pkg/controllers/ingressstate/ingress_state_controller.go (1)

63-69: Informer wiring correctly gates OIDC checks on synced caches

Hooking common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) into the controller factory’s WithInformers set cleanly ensures the OIDC-related informers are tracked and must HasSynced before sync runs. This aligns with the PR’s goal of avoiding upgrade-time races with unsynced caches, without altering existing control flow.

pkg/controllers/common/external_oidc.go (2)

59-69: Appropriate HasSynced guards for the upgrade race condition fix.

The upfront sync checks correctly ensure the informer caches are consistent before proceeding, which addresses the root cause of the upgrade-time race described in the PR objectives.

82-90: Node status validation looks correct and addresses prior review feedback.

The empty node statuses check is now before the loop, and the early return on invalid CurrentRevision terminates the loop immediately as previously suggested.

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

61-61: Correctly wires AuthConfigChecker informers to the controller factory.

This ensures the factory waits for the authentication, kubeapiservers, and configmaps informers to sync before invoking sync(), which complements the HasSynced checks added in OIDCAvailable().

pkg/controllers/common/external_oidc_test.go (2)

23-71: Sync-flagged scenarios and error/availability expectations look correct

The added authInformerSynced, kasInformerSynced, and cmInformerSynced flags, plus the new cases for unsynced informers and invalid/zero node revisions, line up well with the intended behavior: failing fast with errors when you can’t reliably infer OIDC state, and otherwise driving availability off the rollout state. No issues from a correctness or maintainability standpoint here.

Also applies to: 243-362

371-399: Informer wiring with NewFakeSharedIndexInformerWithSync is sound

Switching the KAS, auth, and configmap informers over to test.NewFakeSharedIndexInformerWithSync(...) and updating the indexer keyfuncs to func(obj any) (string, error) matches the new informer interfaces and accurately injects HasSynced behavior into the tests. This is a clean, maintainable way to reproduce the original upgrade-time race in a controlled manner.

pkg/controllers/deployment/deployment_controller.go (1)

116-133: Informer wiring for AuthConfigChecker looks correct and aligns with PR goals

Factoring cluster-scoped informers into clusterScopedInformers and appending AuthConfigCheckerInformers cleanly ensures the workload controller now waits on all relevant caches (ingress, proxy, nodes, and OIDC-related informers) before use. This directly addresses the race around unsynced informers without introducing extra complexity or obvious regressions.

liouk · 2025-12-02T09:23:48Z

Latest push reorganizes some code, no effective change on functionality; verification stands.

/verified by @xingxingxia

openshift-ci-robot · 2025-12-02T09:24:01Z

@liouk: This PR has been marked as verified by @xingxingxia.

Details

In response to this:

Latest push reorganizes some code, no effective change on functionality; verification stands.

/verified by @xingxingxia

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

…rmine oidc state

Also, make the check fail if informers are not synced to avoid false negatives.

liouk · 2025-12-02T09:26:49Z

/verified by @xingxingxia

openshift-ci-robot · 2025-12-02T09:27:02Z

@liouk: This PR has been marked as verified by @xingxingxia.

Details

In response to this:

/verified by @xingxingxia

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1)

56-64: AuthConfigChecker informers correctly wired; consider using controller field to avoid duplicate copy

Adding WithInformers(common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)...) ensures this controller won’t call OIDCAvailable() before the underlying informers are synced, which addresses the upgrade race you’re fixing.

You now have two copies of AuthConfigChecker here (the struct field and the local value whose address is passed to AuthConfigCheckerInformers). It’s safe because both copies hold references to the same underlying informers, but if AuthConfigChecker ever gains mutable state, the field and the pointer could diverge. Consider switching the call to use &controller.authConfigChecker instead, to keep a single canonical instance.

Please double‑check that no future code intends to mutate AuthConfigChecker state; if so, updating all similar constructors to pass the struct field pointer will avoid subtle bugs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d6af55f and 3265312.

📒 Files selected for processing (11)

pkg/controllers/common/external_oidc.go (2 hunks)
pkg/controllers/common/external_oidc_test.go (18 hunks)
pkg/controllers/deployment/deployment_controller.go (2 hunks)
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1 hunks)
pkg/controllers/ingressstate/ingress_state_controller.go (1 hunks)
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (3 hunks)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1 hunks)
pkg/controllers/readiness/wellknown_ready_controller.go (1 hunks)
pkg/controllers/routercerts/controller_test.go (2 hunks)
pkg/operator/starter.go (1 hunks)
test/library/informer.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/controllers/readiness/wellknown_ready_controller.go
pkg/controllers/routercerts/controller_test.go
pkg/controllers/common/external_oidc_test.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/controllers/proxyconfig/proxyconfig_controller.go
pkg/operator/starter.go
pkg/controllers/common/external_oidc.go
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go
test/library/informer.go
pkg/controllers/deployment/deployment_controller.go
pkg/controllers/ingressstate/ingress_state_controller.go

🧬 Code graph analysis (5)

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/operator/starter.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (2)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go (1)

NewEndpointAccessibleController (38-67)

pkg/controllers/deployment/deployment_controller.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

pkg/controllers/ingressstate/ingress_state_controller.go (1)

pkg/controllers/common/external_oidc.go (1)

AuthConfigCheckerInformers (46-52)

🔇 Additional comments (7)

pkg/operator/starter.go (1)

133-137: OIDC informers now consistently tracked by workload, static resources, and APIService controllers

Wiring AuthConfigCheckerInformers into:

the static resource controller via AddInformer,

the OAuth API server workload via the WithWorkloadController extra informer slice, and

the APIService controller via the trailing variadic informers,

makes the controllers that consult oidcAvailable/OIDCAvailable() wait for the same caches before syncing. This aligns with the race fix described in the PR and should prevent transient misclassification of OIDC state during upgrades.

The append([]factory.Informer{authenticationOperatorClient.Informer()}, common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)... )... pattern also keeps the operator client informer in the trigger set while extending it cleanly.

Please re‑run the external OIDC and upgrade tests (e.g. the e2e‑oidc suites you referenced in the PR) to confirm there are no new transient Degraded/Available flips in these controllers now that they depend on the synced AuthConfigChecker informers.

Also applies to: 172-174, 479-499, 583-584

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

56-63: ProxyConfig checker correctly gated on AuthConfigChecker informers

Including common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) in the controller’s informers set matches the new OIDCAvailable() behavior: the factory won’t invoke sync until these informers report HasSynced(), so the proxy checker no longer risks reporting errors based on partially populated OIDC/KAS/configmap caches.

Please ensure existing proxy configuration tests (especially around NO_PROXY and external OIDC) still pass, since errors from OIDCAvailable() will now surface as Degraded via WithSyncDegradedOnError.

Also applies to: 75-81

pkg/controllers/deployment/deployment_controller.go (1)

116-122: Cluster-scoped informer slice refactor + OIDC informers looks solid

Creating clusterScopedInformers and appending AuthConfigCheckerInformers before passing them into workload.NewController both improves readability (one place to see all cluster-wide triggers) and guarantees that the oauth-server workload controller waits for the OIDC/KAS/configmap caches it queries via authConfigChecker.

No functional issues stand out; the slice composition and variadic use are idiomatic.

It would be good to confirm that any tests asserting oauth‑server deployment behavior around OIDC transitions (e.g., when disabling the integrated OAuth server) still behave as expected with this extended informer set.

Also applies to: 123-135

pkg/controllers/ingressstate/ingress_state_controller.go (1)

63-71: IngressState controller now correctly tracks OIDC-related informers

Adding AuthConfigCheckerInformers to the controller’s informer set aligns with the stricter OIDCAvailable() implementation. The controller will only attempt the OIDC‑based short‑circuit (and subsequent ingress endpoints health evaluation) once the authentication, kube‑apiserver, and relevant configmap caches are in sync, which should eliminate the transient upgrade‑time status flips you were seeing.

Please verify that the ingress endpoints–related Degraded conditions during upgrade behave as expected now (i.e., no spurious flips when caches are still warming).

Also applies to: 83-91

pkg/controllers/common/external_oidc.go (1)

46-52: OIDCAvailable now robustly validates informer sync and kube-apiserver node status

The tightened OIDCAvailable() logic looks correct and aligns with the upgrade-race fix:

HasSynced() checks for all three informers (Authentications, KubeAPIServers, and KAS namespace ConfigMaps) ensure you never evaluate OIDC state against partially populated caches.

Explicitly erroring when kas.Status.NodeStatuses is empty and when any node has CurrentRevision <= 0 prevents ambiguous “false but healthy” outcomes when the apiserver rollout has not yet produced valid node status entries.

The existing per-revision checks (auth-config and config ConfigMaps plus config.yaml content) now operate only on validated revisions, which keeps the success path semantically unchanged while removing the old silent-success-on-empty-observed-set behavior.

This should turn the previous transient misclassifications into clear, actionable errors while the new informer wiring prevents those errors from appearing once caches are actually synced.

Given the broader behavior change (errors instead of silent false in several edge cases), please confirm that all controllers and tests consuming OIDCAvailable() (including the endpoint-accessible controllers and any helper like oidcAvailable(...) in this package) correctly surface these errors as Degraded or log them as intended, and that you have coverage for:

empty NodeStatuses,

nodes with non-positive CurrentRevision,

informers not yet synced.

Also applies to: 59-69, 77-92, 94-118

pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (1)

51-75: OAuth endpoint health controllers correctly extended with AuthConfigChecker informers

For all three controllers (NewOAuthRouteCheckController, NewOAuthServiceCheckController, and NewOAuthServiceEndpointsCheckController), constructing a local []factory.Informer with the existing route/service/endpoints/configmap triggers and then appending common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) cleanly ties the endpoint-accessible controllers to the same caches that OIDCAvailable() inspects.

Combined with the stricter OIDCAvailable() preconditions, this should eliminate the previous situation where endpoint health checks ran against unsynced authentication/KAS/configmap state and briefly flipped conditions during upgrades.

Please confirm that the endpoint-accessible controllers still behave as expected when OIDC is disabled (checks run) vs enabled (checks short-circuit via endpointCheckDisabledFunc), especially around upgrades where informers are catching up.

Also applies to: 84-106, 116-138

test/library/informer.go (1)

8-38: Test helper cleanly exposes configurable HasSynced for informers

FakeSharedIndexInformerWithSync is a straightforward way to decouple the lister from an informer whose HasSynced() behavior can be controlled in tests, while still reusing v1helpers.NewFakeSharedIndexInformer() for the underlying implementation. This should make it much easier to exercise the new OIDCAvailable() sync gating logic without impacting production code paths.

When you wire this helper into tests for AuthConfigChecker and the controllers using AuthConfigCheckerInformers, please ensure you cover both hasSynced = false (expecting errors / no sync) and hasSynced = true (normal behavior) to validate the new race protections.

everettraven

/lgtm

openshift-ci · 2025-12-02T17:10:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: everettraven, liouk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [liouk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-12-02T17:14:18Z

@liouk: Jira Issue Verification Checks: Jira Issue OCPBUGS-65675
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-65675 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

xingxingxia · 2025-12-03T01:24:02Z

/jira backport release-4.20

openshift-ci-robot · 2025-12-03T01:25:20Z

@xingxingxia: The following backport issues have been created:

OCPBUGS-66315 for branch release-4.20

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.20

Details

In response to this:

/jira backport release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2025-12-03T01:26:08Z

@openshift-ci-robot: new pull request created: #814

Details

In response to this:

@xingxingxia: The following backport issues have been created:

OCPBUGS-66315 for branch release-4.20

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.20

In response to this:

/jira backport release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2025

openshift-ci bot requested a review from ibihim October 16, 2025 09:44

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

everettraven reviewed Oct 21, 2025

View reviewed changes

liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 71dfa10 to 4d280bd Compare October 23, 2025 09:14

liouk changed the title ~~WIP: NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state~~ NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state Oct 23, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2025

everettraven reviewed Oct 24, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2025

liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 90f2f82 to 702bf57 Compare November 6, 2025 09:11

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

liouk changed the title ~~NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state~~ OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state Nov 17, 2025

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 17, 2025

liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 45ba4f8 to 3eba97f Compare November 19, 2025 13:43

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 26, 2025

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 26, 2025

openshift-ci bot requested a review from xingxingxia November 26, 2025 14:00

everettraven reviewed Dec 1, 2025

View reviewed changes

liouk force-pushed the fix-oidc-available-condition branch from 49f961c to d6af55f Compare December 2, 2025 09:20

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025

liouk added 2 commits December 2, 2025 10:26

externaloidc: return errors when node statuses cannot be used to dete…

4922e2b

…rmine oidc state

externaloidc: track informers required for OIDC availability checks

3265312

Also, make the check fail if informers are not synced to avoid false negatives.

liouk force-pushed the fix-oidc-available-condition branch from d6af55f to 3265312 Compare December 2, 2025 09:26

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

everettraven approved these changes Dec 2, 2025

View reviewed changes

openshift-ci bot assigned everettraven Dec 2, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2025

openshift-merge-bot bot merged commit e6c52f8 into openshift:master Dec 2, 2025
14 checks passed

openshift-cherrypick-robot mentioned this pull request Dec 3, 2025

[release-4.20] OCPBUGS-66315: externaloidc: return errors when node statuses cannot be used to determine oidc state #814

Merged

neisw mentioned this pull request Dec 3, 2025

Revert "OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state" #815

Closed

OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

Conversation

liouk commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 16, 2025

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liouk commented Oct 21, 2025

Uh oh!

everettraven Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

liouk Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

liouk Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

everettraven Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

liouk Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

xingxingxia commented Oct 25, 2025

Uh oh!

liouk commented Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liouk commented Nov 17, 2025

Uh oh!

openshift-ci-robot commented Nov 17, 2025

Uh oh!

openshift-ci-robot commented Nov 17, 2025

Uh oh!

liouk commented Nov 19, 2025

Uh oh!

xingxingxia commented Nov 20, 2025

Uh oh!

openshift-ci bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Nov 26, 2025

Uh oh!

everettraven left a comment

Choose a reason for hiding this comment

Uh oh!

everettraven Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liouk commented Dec 2, 2025

Uh oh!

openshift-ci-robot commented Dec 2, 2025

Uh oh!

liouk commented Dec 2, 2025

Uh oh!

openshift-ci-robot commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

liouk Oct 23, 2025 •

edited

Loading

openshift-ci bot commented Nov 20, 2025 •

edited

Loading