Skip to content

Conversation

@liouk
Copy link
Member

@liouk liouk commented Oct 16, 2025

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This pull request explicitly references no jira issue.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 16, 2025

Walkthrough

Added HasSynced gating and stricter kube-apiserver node CurrentRevision validation to OIDC availability checks; introduced a reusable test fake informer with configurable HasSynced; and wired AuthConfigChecker informers into multiple controllers and the operator workload wiring.

Changes

Cohort / File(s) Summary
OIDC validation & tests
pkg/controllers/common/external_oidc.go, pkg/controllers/common/external_oidc_test.go
Add upfront informer HasSynced checks (authentications, kubeapiservers, configmaps). Validate kas.Status.NodeStatuses exists and each node has CurrentRevision > 0; remove previous silent-success path for empty observed revisions. Tests updated to use synced fake shared informers, per-test HasSynced flags, and indexer callbacks changed to func(obj any).
Test informer helper
test/library/informer.go
Add generic FakeSharedIndexInformerWithSync[T any] with NewFakeSharedIndexInformerWithSync, Informer() and Lister() to produce informers whose HasSynced() is configurable for tests.
Deployment controller informer grouping
pkg/controllers/deployment/deployment_controller.go
Create clusterScopedInformers (Ingresses, Proxies, Nodes) and augment it with AuthConfigCheckerInformers; use this grouped slice when constructing the workload controller.
Controllers wired with AuthConfigChecker informers
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go, pkg/controllers/ingressstate/ingress_state_controller.go, pkg/controllers/oauthendpoints/oauth_endpoints_controller.go, pkg/controllers/proxyconfig/proxyconfig_controller.go, pkg/controllers/readiness/wellknown_ready_controller.go, pkg/operator/starter.go
Wire common.AuthConfigCheckerInformers into multiple controllers and operator workload wiring by appending or using WithInformers(...); replace several hard-coded informer lists with augmented slices. No public function signatures changed.
Tests: routercerts & other tests
pkg/controllers/routercerts/controller_test.go, pkg/controllers/.../*_test.go (tests updated similarly)
Replace local fake informer wrappers with test.NewFakeSharedIndexInformerWithSync(...), remove legacy local fakeInformer types, and update test wiring to respect the configured HasSynced flag.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Focus review on pkg/controllers/common/external_oidc.go for correctness of HasSynced gating, NodeStatuses presence check, and CurrentRevision logic and error messages.
  • Verify tests (pkg/controllers/common/external_oidc_test.go, pkg/controllers/routercerts/controller_test.go) correctly instantiate and use test.NewFakeSharedIndexInformerWithSync and that indexer callbacks declared as func(obj any) behave as expected.
  • Check test/library/informer.go to ensure the fake informer's Informer() and HasSynced() semantics align with test assumptions.
  • Scan controller wiring changes to ensure appending AuthConfigCheckerInformers does not unintentionally omit previously required informers.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request contains no description provided by the author. The pass criterion for this check requires that "the description is related in some way to the changeset," but an empty or absent description cannot satisfy this requirement as it provides no information to relate to the changes. Although the check is lenient regarding level of detail, it still requires some description to exist and be connected to the changeset, which is not present in this case. Add a pull request description that explains the motivation and context for returning errors when node statuses cannot be used to determine OIDC state. Even a brief description that relates to the changeset would satisfy this check; for example, explaining the issue being addressed or the benefit of these error handling improvements would help reviewers understand the PR's purpose and context.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state" directly describes the primary changes in the pull request. The main modifications are to the external OIDC component to add validation that returns errors when node statuses are empty, contain zero revisions, or cannot be reliably used to determine OIDC state. The title is specific, clear, and accurately captures the core objective of the changeset. While secondary logging additions exist in the endpoint_accessible_controller component, they are minor compared to the main focus on OIDC error handling improvements.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested a review from ibihim October 16, 2025 09:44
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)

71-78: LGTM! Logic correctly filters invalid revisions.

The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.

One minor style nitpick:

-			numNodesWithEmptyRevision += 1
+			numNodesWithEmptyRevision++

The ++ operator is more idiomatic in Go for simple increments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 215805c and fc58d2d.

📒 Files selected for processing (2)
  • pkg/controllers/common/external_oidc.go (1 hunks)
  • pkg/controllers/common/external_oidc_test.go (1 hunks)
🔇 Additional comments (6)
pkg/controllers/common/external_oidc.go (3)

80-82: Good validation: catch missing node status data early.

Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.


84-86: Excellent validation: reject partial/invalid node data.

Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.


88-90: Approve defensive check, though technically unreachable.

This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:

  • If len(kas.Status.NodeStatuses) == 0, line 80-82 returns early
  • If all nodes have CurrentRevision <= 0, line 84-86 returns early
  • If any nodes have CurrentRevision > 0, observedRevisions will have entries

The check serves as a safety net and is acceptable to keep, especially in a WIP PR.

pkg/controllers/common/external_oidc_test.go (3)

35-36: LGTM! Test correctly expects error for missing node statuses.

The updated expectation aligns with the new validation in OIDCAvailable() that returns an error when no node statuses are found.


37-47: LGTM! Test coverage for partial zero revisions.

This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.


48-58: LGTM! Test coverage for all zero revisions.

This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.

@liouk
Copy link
Member Author

liouk commented Oct 21, 2025

/test e2e-oidc-techpreview

Comment on lines +80 to +84
if len(kas.Status.NodeStatuses) == 0 {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this before the for loop that iterates through the node statuses?

}

observedRevisions := sets.New[int32]()
numNodesWithEmptyRevision := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track this with a counter-like variable?

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

Copy link
Member Author

@liouk liouk Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track this with a counter-like variable?

We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.

@liouk liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 71dfa10 to 4d280bd Compare October 23, 2025 09:14
@liouk liouk changed the title WIP: NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state Oct 23, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2025
Comment on lines 75 to 85
nodesWithEmptyRevision := false
for _, nodeStatus := range kas.Status.NodeStatuses {
observedRevisions.Insert(nodeStatus.CurrentRevision)
if nodeStatus.CurrentRevision > 0 {
observedRevisions.Insert(nodeStatus.CurrentRevision)
} else {
nodesWithEmptyRevision = true
}
}

if nodesWithEmptyRevision {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?

As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.

Suggested change
nodesWithEmptyRevision := false
for _, nodeStatus := range kas.Status.NodeStatuses {
observedRevisions.Insert(nodeStatus.CurrentRevision)
if nodeStatus.CurrentRevision > 0 {
observedRevisions.Insert(nodeStatus.CurrentRevision)
} else {
nodesWithEmptyRevision = true
}
}
if nodesWithEmptyRevision {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
for _, nodeStatus := range kas.Status.NodeStatuses {
if nodeStatus.CurrentRevision <= 0 {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
}
observedRevisions.Insert(nodeStatus.CurrentRevision)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course -- now that we don't use the count this is much better 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you still want to take this suggestion?

It looks like this is still outstanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! This one slipped through. Fixed it now.

@xingxingxia
Copy link
Contributor

This PR is to solve the separate issue I saw in another test #798 (comment) .

Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already /verified as I commented in that PR.
For this #801, I pre-merge tested as below:

# Cluster-Bot payload 1
build 4.21.0-0.nightly-2025-10-24-233040,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

# Cluster-Bot payload 2
build 4.21.0-0.nightly-2025-10-25-063101,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

Step 1
Launched a cluster with payload 1. Configured external oidc auth on the cluster. Rollout completed after waiting ~ 20m. Checked oc/console logins et al which all worked.
Step 2
At 09:47:45, starting upgrade to payload 2:

[xxia@2025-10-25 09:47:45 GMT my]$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest # payload 2
...
Requested update to release image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest
[xxia@2025-10-25 09:47:49 GMT my]$

At 10:51:14, the upgrade completed:

[xxia@2025-10-25 10:51:14 GMT my]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest   True        False         39s     Cluster version is 4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest
[xxia@2025-10-25 10:51:16 GMT my]$

Step 3
Checked CAO logs. The issue still happened twice during upgrading, respectively at 10:14:00 and 10:29:20:

[xxia@2025-10-25 10:52:59 GMT my]$ oc get event -n openshift-authentication-operator -o json > events-openshift-authentication-operator.json
[xxia@2025-10-25 10:53:04 GMT my]$ cat events-openshift-authentication-operator.json | jq -r '.items[] | select(.message | test ("Available changed from")) | "\(.firstTimestamp) \(.count) \(.message)"'
...
2025-10-25T10:14:00Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found"),status.relatedObjects changed from [{"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}]
2025-10-25T10:14:01Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")
2025-10-25T10:29:20Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found")
2025-10-25T10:29:23Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")

So the verification fails. @liouk

@liouk
Copy link
Member Author

liouk commented Nov 3, 2025

Added debug logging to investigate the issue found by @xingxingxia.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2025
@liouk liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 90f2f82 to 702bf57 Compare November 6, 2025 09:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)

79-120: Use a verbose log level for the new debug statements. These [debug-801] messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g. klog.V(4)) or add an explicit verbosity check.

-			klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+			klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
@@
-			klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
+			klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
@@
-			klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
+			klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 90f2f82 and 702bf57.

📒 Files selected for processing (2)
  • pkg/controllers/common/external_oidc.go (3 hunks)
  • pkg/libs/endpointaccessible/endpoint_accessible_controller.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/libs/endpointaccessible/endpoint_accessible_controller.go

@liouk liouk changed the title NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state OCPBUGS-65675: externaloidc: return errors when node statuses cannot be used to determine oidc state Nov 17, 2025
@liouk
Copy link
Member Author

liouk commented Nov 17, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 17, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@liouk liouk force-pushed the fix-oidc-available-condition branch 2 times, most recently from 45ba4f8 to 3eba97f Compare November 19, 2025 13:43
@liouk
Copy link
Member Author

liouk commented Nov 19, 2025

@xingxingxia I've provided a fix for the observed behavior; the issue was that some controllers (the ones that aren't managing any resources, but rather running checks) were not tracking the informers needed to check for OIDC configuration availability. As a result, during upgrade, the informers were being used before having synced.

Originally this was done on purpose, in order to avoid the overhead of tracking and reacting to changes in those informers, as these controllers are not actively managing any operands, so relying on their next sync was supposedly sufficient. However I had not anticipated this edge-case.

Since these informers aren't expected to get changes frequently (two cluster singletons, one configmap informer for the kas namespace), I believe being consistent with synced caches is more important than this overhead. Therefore the fix in 3eba97f.

@xingxingxia
Copy link
Contributor

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 20, 2025

@liouk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 702bf57 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 26, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This pull request references Jira Issue OCPBUGS-65675, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 26, 2025
@openshift-ci openshift-ci bot requested a review from xingxingxia November 26, 2025 14:00
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is still one outstanding suggestion.

Other than that, this LGTM.

Comment on lines 75 to 85
nodesWithEmptyRevision := false
for _, nodeStatus := range kas.Status.NodeStatuses {
observedRevisions.Insert(nodeStatus.CurrentRevision)
if nodeStatus.CurrentRevision > 0 {
observedRevisions.Insert(nodeStatus.CurrentRevision)
} else {
nodesWithEmptyRevision = true
}
}

if nodesWithEmptyRevision {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you still want to take this suggestion?

It looks like this is still outstanding.

@liouk liouk force-pushed the fix-oidc-available-condition branch from 49f961c to d6af55f Compare December 2, 2025 09:20
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controllers/common/external_oidc_test.go (1)

285-303: Likely typo: duplicate config-11 configmap in “two nodes ready” scenario

In the "oidc getting disabled, rollout in progress, two nodes ready" case the configMaps slice contains "config-11" twice and no "config-12":

cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
cm("config-13", "config.yaml", kasConfigJSONWithoutOIDC),

Because the indexer keys by name/namespace, the second "config-11" overwrites the first, and this scenario won’t actually exercise a distinct config-12 revision despite the surrounding tests and node statuses implying 11/12/13 should all be present. This weakens coverage for the “two nodes ready” disabling rollout.

Suggest correcting the second entry to config-12:

-               cm("config-11", "config.yaml", kasConfigJSONWithOIDC),
+               cm("config-12", "config.yaml", kasConfigJSONWithOIDC),
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)

94-96: Unreachable code: this check can never be true.

With the current logic:

  1. Line 82 returns if len(kas.Status.NodeStatuses) == 0
  2. Line 88-90 returns if any CurrentRevision <= 0
  3. Otherwise, line 91 inserts into observedRevisions

So after the loop, observedRevisions.Len() >= 1 is guaranteed. This condition can never trigger.

Consider removing the dead code:

-	if observedRevisions.Len() == 0 {
-		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no observed revisions found")
-	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 49f961c and d6af55f.

📒 Files selected for processing (11)
  • pkg/controllers/common/external_oidc.go (2 hunks)
  • pkg/controllers/common/external_oidc_test.go (18 hunks)
  • pkg/controllers/deployment/deployment_controller.go (2 hunks)
  • pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1 hunks)
  • pkg/controllers/ingressstate/ingress_state_controller.go (1 hunks)
  • pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (3 hunks)
  • pkg/controllers/proxyconfig/proxyconfig_controller.go (1 hunks)
  • pkg/controllers/readiness/wellknown_ready_controller.go (1 hunks)
  • pkg/controllers/routercerts/controller_test.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • test/library/informer.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go
  • pkg/controllers/oauthendpoints/oauth_endpoints_controller.go
  • pkg/controllers/readiness/wellknown_ready_controller.go
  • pkg/controllers/routercerts/controller_test.go
  • pkg/operator/starter.go
  • test/library/informer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/controllers/ingressstate/ingress_state_controller.go
  • pkg/controllers/deployment/deployment_controller.go
  • pkg/controllers/common/external_oidc.go
  • pkg/controllers/proxyconfig/proxyconfig_controller.go
  • pkg/controllers/common/external_oidc_test.go
🧬 Code graph analysis (3)
pkg/controllers/ingressstate/ingress_state_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/controllers/common/external_oidc_test.go (1)
test/library/informer.go (1)
  • NewFakeSharedIndexInformerWithSync (13-18)
🔇 Additional comments (7)
pkg/controllers/ingressstate/ingress_state_controller.go (1)

63-69: Informer wiring correctly gates OIDC checks on synced caches

Hooking common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) into the controller factory’s WithInformers set cleanly ensures the OIDC-related informers are tracked and must HasSynced before sync runs. This aligns with the PR’s goal of avoiding upgrade-time races with unsynced caches, without altering existing control flow.

pkg/controllers/common/external_oidc.go (2)

59-69: Appropriate HasSynced guards for the upgrade race condition fix.

The upfront sync checks correctly ensure the informer caches are consistent before proceeding, which addresses the root cause of the upgrade-time race described in the PR objectives.


82-90: Node status validation looks correct and addresses prior review feedback.

The empty node statuses check is now before the loop, and the early return on invalid CurrentRevision terminates the loop immediately as previously suggested.

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

61-61: Correctly wires AuthConfigChecker informers to the controller factory.

This ensures the factory waits for the authentication, kubeapiservers, and configmaps informers to sync before invoking sync(), which complements the HasSynced checks added in OIDCAvailable().

pkg/controllers/common/external_oidc_test.go (2)

23-71: Sync-flagged scenarios and error/availability expectations look correct

The added authInformerSynced, kasInformerSynced, and cmInformerSynced flags, plus the new cases for unsynced informers and invalid/zero node revisions, line up well with the intended behavior: failing fast with errors when you can’t reliably infer OIDC state, and otherwise driving availability off the rollout state. No issues from a correctness or maintainability standpoint here.

Also applies to: 243-362


371-399: Informer wiring with NewFakeSharedIndexInformerWithSync is sound

Switching the KAS, auth, and configmap informers over to test.NewFakeSharedIndexInformerWithSync(...) and updating the indexer keyfuncs to func(obj any) (string, error) matches the new informer interfaces and accurately injects HasSynced behavior into the tests. This is a clean, maintainable way to reproduce the original upgrade-time race in a controlled manner.

pkg/controllers/deployment/deployment_controller.go (1)

116-133: Informer wiring for AuthConfigChecker looks correct and aligns with PR goals

Factoring cluster-scoped informers into clusterScopedInformers and appending AuthConfigCheckerInformers cleanly ensures the workload controller now waits on all relevant caches (ingress, proxy, nodes, and OIDC-related informers) before use. This directly addresses the race around unsynced informers without introducing extra complexity or obvious regressions.

@liouk
Copy link
Member Author

liouk commented Dec 2, 2025

Latest push reorganizes some code, no effective change on functionality; verification stands.

/verified by @xingxingxia

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This PR has been marked as verified by @xingxingxia.

Details

In response to this:

Latest push reorganizes some code, no effective change on functionality; verification stands.

/verified by @xingxingxia

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

liouk added 2 commits December 2, 2025 10:26
Also, make the check fail if informers are not synced to avoid false negatives.
@liouk liouk force-pushed the fix-oidc-available-condition branch from d6af55f to 3265312 Compare December 2, 2025 09:26
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025
@liouk
Copy link
Member Author

liouk commented Dec 2, 2025

/verified by @xingxingxia

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 2, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This PR has been marked as verified by @xingxingxia.

Details

In response to this:

/verified by @xingxingxia

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1)

56-64: AuthConfigChecker informers correctly wired; consider using controller field to avoid duplicate copy

Adding WithInformers(common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)...) ensures this controller won’t call OIDCAvailable() before the underlying informers are synced, which addresses the upgrade race you’re fixing.

You now have two copies of AuthConfigChecker here (the struct field and the local value whose address is passed to AuthConfigCheckerInformers). It’s safe because both copies hold references to the same underlying informers, but if AuthConfigChecker ever gains mutable state, the field and the pointer could diverge. Consider switching the call to use &controller.authConfigChecker instead, to keep a single canonical instance.

Please double‑check that no future code intends to mutate AuthConfigChecker state; if so, updating all similar constructors to pass the struct field pointer will avoid subtle bugs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d6af55f and 3265312.

📒 Files selected for processing (11)
  • pkg/controllers/common/external_oidc.go (2 hunks)
  • pkg/controllers/common/external_oidc_test.go (18 hunks)
  • pkg/controllers/deployment/deployment_controller.go (2 hunks)
  • pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go (1 hunks)
  • pkg/controllers/ingressstate/ingress_state_controller.go (1 hunks)
  • pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (3 hunks)
  • pkg/controllers/proxyconfig/proxyconfig_controller.go (1 hunks)
  • pkg/controllers/readiness/wellknown_ready_controller.go (1 hunks)
  • pkg/controllers/routercerts/controller_test.go (2 hunks)
  • pkg/operator/starter.go (1 hunks)
  • test/library/informer.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/controllers/readiness/wellknown_ready_controller.go
  • pkg/controllers/routercerts/controller_test.go
  • pkg/controllers/common/external_oidc_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/controllers/proxyconfig/proxyconfig_controller.go
  • pkg/operator/starter.go
  • pkg/controllers/common/external_oidc.go
  • pkg/controllers/ingressnodesavailable/ingress_nodes_available_controller.go
  • pkg/controllers/oauthendpoints/oauth_endpoints_controller.go
  • test/library/informer.go
  • pkg/controllers/deployment/deployment_controller.go
  • pkg/controllers/ingressstate/ingress_state_controller.go
🧬 Code graph analysis (5)
pkg/controllers/proxyconfig/proxyconfig_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/operator/starter.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (2)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/libs/endpointaccessible/endpoint_accessible_controller.go (1)
  • NewEndpointAccessibleController (38-67)
pkg/controllers/deployment/deployment_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
pkg/controllers/ingressstate/ingress_state_controller.go (1)
pkg/controllers/common/external_oidc.go (1)
  • AuthConfigCheckerInformers (46-52)
🔇 Additional comments (7)
pkg/operator/starter.go (1)

133-137: OIDC informers now consistently tracked by workload, static resources, and APIService controllers

Wiring AuthConfigCheckerInformers into:

  • the static resource controller via AddInformer,
  • the OAuth API server workload via the WithWorkloadController extra informer slice, and
  • the APIService controller via the trailing variadic informers,

makes the controllers that consult oidcAvailable/OIDCAvailable() wait for the same caches before syncing. This aligns with the race fix described in the PR and should prevent transient misclassification of OIDC state during upgrades.

The append([]factory.Informer{authenticationOperatorClient.Informer()}, common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker)... )... pattern also keeps the operator client informer in the trigger set while extending it cleanly.

Please re‑run the external OIDC and upgrade tests (e.g. the e2e‑oidc suites you referenced in the PR) to confirm there are no new transient Degraded/Available flips in these controllers now that they depend on the synced AuthConfigChecker informers.

Also applies to: 172-174, 479-499, 583-584

pkg/controllers/proxyconfig/proxyconfig_controller.go (1)

56-63: ProxyConfig checker correctly gated on AuthConfigChecker informers

Including common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) in the controller’s informers set matches the new OIDCAvailable() behavior: the factory won’t invoke sync until these informers report HasSynced(), so the proxy checker no longer risks reporting errors based on partially populated OIDC/KAS/configmap caches.

Please ensure existing proxy configuration tests (especially around NO_PROXY and external OIDC) still pass, since errors from OIDCAvailable() will now surface as Degraded via WithSyncDegradedOnError.

Also applies to: 75-81

pkg/controllers/deployment/deployment_controller.go (1)

116-122: Cluster-scoped informer slice refactor + OIDC informers looks solid

Creating clusterScopedInformers and appending AuthConfigCheckerInformers before passing them into workload.NewController both improves readability (one place to see all cluster-wide triggers) and guarantees that the oauth-server workload controller waits for the OIDC/KAS/configmap caches it queries via authConfigChecker.

No functional issues stand out; the slice composition and variadic use are idiomatic.

It would be good to confirm that any tests asserting oauth‑server deployment behavior around OIDC transitions (e.g., when disabling the integrated OAuth server) still behave as expected with this extended informer set.

Also applies to: 123-135

pkg/controllers/ingressstate/ingress_state_controller.go (1)

63-71: IngressState controller now correctly tracks OIDC-related informers

Adding AuthConfigCheckerInformers to the controller’s informer set aligns with the stricter OIDCAvailable() implementation. The controller will only attempt the OIDC‑based short‑circuit (and subsequent ingress endpoints health evaluation) once the authentication, kube‑apiserver, and relevant configmap caches are in sync, which should eliminate the transient upgrade‑time status flips you were seeing.

Please verify that the ingress endpoints–related Degraded conditions during upgrade behave as expected now (i.e., no spurious flips when caches are still warming).

Also applies to: 83-91

pkg/controllers/common/external_oidc.go (1)

46-52: OIDCAvailable now robustly validates informer sync and kube-apiserver node status

The tightened OIDCAvailable() logic looks correct and aligns with the upgrade-race fix:

  • HasSynced() checks for all three informers (Authentications, KubeAPIServers, and KAS namespace ConfigMaps) ensure you never evaluate OIDC state against partially populated caches.
  • Explicitly erroring when kas.Status.NodeStatuses is empty and when any node has CurrentRevision <= 0 prevents ambiguous “false but healthy” outcomes when the apiserver rollout has not yet produced valid node status entries.
  • The existing per-revision checks (auth-config and config ConfigMaps plus config.yaml content) now operate only on validated revisions, which keeps the success path semantically unchanged while removing the old silent-success-on-empty-observed-set behavior.

This should turn the previous transient misclassifications into clear, actionable errors while the new informer wiring prevents those errors from appearing once caches are actually synced.

Given the broader behavior change (errors instead of silent false in several edge cases), please confirm that all controllers and tests consuming OIDCAvailable() (including the endpoint-accessible controllers and any helper like oidcAvailable(...) in this package) correctly surface these errors as Degraded or log them as intended, and that you have coverage for:

  • empty NodeStatuses,
  • nodes with non-positive CurrentRevision,
  • informers not yet synced.

Also applies to: 59-69, 77-92, 94-118

pkg/controllers/oauthendpoints/oauth_endpoints_controller.go (1)

51-75: OAuth endpoint health controllers correctly extended with AuthConfigChecker informers

For all three controllers (NewOAuthRouteCheckController, NewOAuthServiceCheckController, and NewOAuthServiceEndpointsCheckController), constructing a local []factory.Informer with the existing route/service/endpoints/configmap triggers and then appending common.AuthConfigCheckerInformers[factory.Informer](&authConfigChecker) cleanly ties the endpoint-accessible controllers to the same caches that OIDCAvailable() inspects.

Combined with the stricter OIDCAvailable() preconditions, this should eliminate the previous situation where endpoint health checks ran against unsynced authentication/KAS/configmap state and briefly flipped conditions during upgrades.

Please confirm that the endpoint-accessible controllers still behave as expected when OIDC is disabled (checks run) vs enabled (checks short-circuit via endpointCheckDisabledFunc), especially around upgrades where informers are catching up.

Also applies to: 84-106, 116-138

test/library/informer.go (1)

8-38: Test helper cleanly exposes configurable HasSynced for informers

FakeSharedIndexInformerWithSync is a straightforward way to decouple the lister from an informer whose HasSynced() behavior can be controlled in tests, while still reusing v1helpers.NewFakeSharedIndexInformer() for the underlying implementation. This should make it much easier to exercise the new OIDCAvailable() sync gating logic without impacting production code paths.

When you wire this helper into tests for AuthConfigChecker and the controllers using AuthConfigCheckerInformers, please ensure you cover both hasSynced = false (expecting errors / no sync) and hasSynced = true (normal behavior) to validate the new race protections.

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: everettraven, liouk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit e6c52f8 into openshift:master Dec 2, 2025
14 checks passed
@openshift-ci-robot
Copy link
Contributor

@liouk: Jira Issue Verification Checks: Jira Issue OCPBUGS-65675
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-65675 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@xingxingxia
Copy link
Contributor

/jira backport release-4.20

@openshift-ci-robot
Copy link
Contributor

@xingxingxia: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.20

Details

In response to this:

/jira backport release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #814

Details

In response to this:

@xingxingxia: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.20

In response to this:

/jira backport release-4.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants