pkg/cvo: Report current tasks #152

wking · 2019-03-30T07:10:18Z

When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install.

The delete-pointer-while-preserving-order lines in FinishTask() are from here.

The reflect.DeepEqual addition avoids:

./sync_worker_test.go:73:13: invalid operation: evt != tt.next (struct containing []*payload.Task cannot be compared)

When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on [1]. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install. The delete-pointer-while-preserving-order lines in FinishTask() are from [2]: Delete copy(a[i:], a[i+1:]) a[len(a)-1] = nil // or the zero value of T a = a[:len(a)-1] The reflect.DeepEqual addition avoids: ./sync_worker_test.go:73:13: invalid operation: evt != tt.next (struct containing []*payload.Task cannot be compared) [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1691513#c17 [2]: https://github.com/golang/go/wiki/SliceTricks

wking · 2019-03-30T11:36:07Z

This gets long:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/152/pull-ci-openshift-cluster-version-operator-master-e2e-aws/535/artifacts/e2e-aws/installer/.openshift_install.log | grep -A1 '% complete'
time="2019-03-30T07:40:01Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 63% complete (Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, OAuthClient console, Role openshift-image-registry/prometheus-k8s, ServiceMonitor openshift-kube-controller-manager-operator/kube-controller-manager-operator, ServiceMonitor openshift-kube-apiserver-operator/kube-apiserver-operator, ServiceAccount openshift-monitoring/cluster-monitoring-operator, ServiceMonitor openshift-controller-manager-operator/openshift-controller-manager-operator, ServiceMonitor openshift-kube-scheduler-operator/kube-scheduler-operator, ServiceAccount openshift-cluster-samples-operator/cluster-samples-operator, ServiceMonitor openshift-apiserver-operator/openshift-apiserver-operator, ConfigMap openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-config, ServiceAccount openshift-cluster-node-tuning-operator/cluster-node-tuning-operator, ConfigMap openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator-config, RoleBinding openshift-cluster-storage-operator/cluster-storage-operator, ConfigMap openshift-authentication-operator/openshift-authentication-operator-config, ServiceAccount openshift-marketplace/marketplace-operator, ClusterOperator machine-config, ClusterOperator kube-apiserver, ClusterOperator kube-controller-manager, ClusterOperator kube-scheduler, ClusterOperator dns, ClusterOperator openshift-apiserver, ClusterOperator openshift-controller-manager, ClusterOperator cluster-autoscaler, ClusterOperator service-ca, ClusterOperator cloud-credential, RoleBinding openshift-ingress-operator/ingress-operator, ServiceAccount openshift-image-registry/cluster-image-registry-operator, ClusterOperator machine-api)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 2% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CustomResourceDefinition credentialsrequests.cloudcredential.openshift.io, Role openshift-kube-scheduler-operator/prometheus-k8s, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-machine-config-operator, ClusterRole openshift-ingress-operator, Namespace openshift-monitoring, Namespace openshift-cluster-node-tuning-operator, CustomResourceDefinition configs.samples.operator.openshift.io, Namespace openshift-service-catalog-apiserver-operator, Namespace openshift-service-catalog-controller-manager-operator, Namespace openshift-cluster-version, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, ClusterRole registry-monitoring, Role openshift-controller-manager-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-operators, CustomResourceDefinition kubeapiservers.operator.openshift.io, Namespace openshift-image-registry)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 6% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-image-registry, ClusterRoleBinding cluster-storage-operator, Namespace openshift-cloud-credential-operator, CustomResourceDefinition mcoconfigs.machineconfiguration.openshift.io, CustomResourceDefinition ingresscontrollers.operator.openshift.io, Namespace openshift-cluster-samples-operator, CustomResourceDefinition servicecatalogapiservers.operator.openshift.io, ClusterRole cluster-monitoring-operator, CustomResourceDefinition tuneds.tuned.openshift.io, ClusterRoleBinding registry-monitoring, RoleBinding openshift-kube-scheduler-operator/prometheus-k8s, CustomResourceDefinition clusteroperators.config.openshift.io, CustomResourceDefinition servicecatalogcontrollermanagers.operator.openshift.io, Service openshift-kube-apiserver-operator/metrics, ClusterRole system:controller:operator-lifecycle-manager)"
...
time="2019-03-30T07:44:41Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:13Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:26Z" level=debug msg="Cluster is initialized"

On the one hand, it's either debug-level, where it's ok to be chatty, or a timeout error, where we don't want to remove whatever the critical, blocking task was. But maybe we want to truncate non-errors somewhat client-side (helped by the fact that the longest in-progress tasks are first in the list)? Is there a way to provide this as structured data vs. my current long string so the web console, etc. can make a slick UX for this?

I expect the less-parallel upgrades will still have long entries, but they should be more manageable than these super-parallel install entries.

openshift-ci-robot · 2019-04-01T06:31:18Z

@wking: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-04-01T06:31:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-04-01T15:15:09Z

pkg/cvo/status.go

 			case status.Fraction > 0:
-				message = fmt.Sprintf("Working towards %s: %.0f%% complete", version, status.Fraction*100)
+				tasks := make([]string, 0, len(status.Current))
+				for _, task := range status.Current {


This is going to be too large, and I don't want to show resource names here.

This is going to be too large...

Can you think of away to attach as structured data, so the consumer can make the verbosity/rendering calls?

smarterclayton · 2019-04-01T15:15:24Z

pkg/cvo/status.go

+				for _, task := range status.Current {
+					tasks = append(tasks, task.KindName())
+				}
+				message = fmt.Sprintf("Working towards %s: %.0f%% complete (%s)", version, status.Fraction*100, strings.Join(tasks, ", "))


The only thing that should ever show up in this message is 1-2 operator names, that's it.

I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

The only thing that should ever show up in this message is 1-2 operator names, that's it.

So what to show when you're, say, blocked on a RoleBinding? Collapse to the namespace?

The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

The issue here is that the installer doesn't know about sync periods, and wants to know what was blocking the CVO when the installer decided to give up on waiting. If the CVO starts right in on another sync cycle (which it does, right?), it seems very unlikely that the last message received by the installer would include blockers, unless we include them in every message.

I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

This or the namespace is sufficient to find the next level of debugging. We have enough operators that indicating which one the CVO is stuck on is pretty useful. In cases where we aren't currently waiting on a CO, knowing the namespace is good enough for most purposes.

smarterclayton · 2019-04-01T15:16:13Z

/hold

I see people getting antsy

abhinavdahiya · 2019-04-01T16:00:47Z

When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install.

controllers in k8s do not include info for all steps into the status objects. controllers like kubelet, deployment etc.. use events to provide information for steps.. like kubelet adds events like image pulled, started, failed due to waiting for volume etc.. deployment controller scaled new replicaset to 1, scalling old replicaset to 0 etc...

so including what CVO is doing into status is not correct IMO.

abhinavdahiya · 2019-04-01T16:02:22Z

pkg/cvo/sync_worker.go

+
+	for i, tsk := range r.status.Current {
+		if tsk == task {
+			copy(r.status.Current[i:], r.status.Current[i+1:])


This is not great :(

This is not great :(

This is the recommended way to delete from a slice of pointers while preserving ordering, per this. Did you have an alternative you prefer?

wking · 2019-04-01T16:35:24Z

so including what CVO is doing into status is not correct IMO.

So why even include percent complete? Some short string like the current percent complete is useful for raw oc get ... output and similar, but I think both the installer and web-console upgrade UI would benefit from more-detailed, structured status information. If that's a separate call to some health-check or status endpoint, that's fine with me. Or is this not possible, and we need the installer and web-console to reach around the CVO and look at ClusterOperator and similar directly when they want more details?

sdodson · 2019-04-03T14:17:25Z

The main thing I want to avoid is the Installer / Installer team having to triage every cluster installation failure. Either we need to report which operators weren't successful or make the next debugging steps clear enough that we help the user pinpoint which component is failing themselves.

smarterclayton · 2019-04-09T05:07:05Z

See #158 for an alternative that while slightly more complex focuses on trying to hide details at a lower level. As a bonus it improves failure messages when multiple payloads are failing.

If the installer today isn't printing the Failing condition if the CVO is reporting it, we should also change the output. That will display a more detailed message than today.

wking · 2019-04-09T08:17:01Z

If the installer today isn't printing the Failing condition if the CVO is reporting it, we should also change the output.

We prefer the Failing message and fall back on Progressing when Failing is not set (code).

wking · 2019-04-11T00:45:32Z

Obsoleted by #158.

/close

openshift-ci-robot · 2019-04-11T00:45:34Z

@wking: Closed this PR.

Details

In response to this:

Obsoleted by #158.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-03-13T21:32:47Z

Also in this space: rhbz#1768255.

openshift-ci-robot requested review from crawford and smarterclayton March 30, 2019 07:10

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2019

wking force-pushed the report-current-tasks branch from 9ceecc0 to bd00466 Compare March 30, 2019 07:17

smarterclayton reviewed Apr 1, 2019

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2019

abhinavdahiya reviewed Apr 1, 2019

View reviewed changes

openshift-ci-robot closed this Apr 11, 2019

pkg/cvo: Report current tasks #152

pkg/cvo: Report current tasks #152

Uh oh!

Conversation

wking commented Mar 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Mar 30, 2019

Uh oh!

openshift-ci-robot commented Apr 1, 2019

Uh oh!

openshift-ci-robot commented Apr 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Apr 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Apr 1, 2019

Uh oh!

abhinavdahiya commented Apr 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking commented Apr 1, 2019

Uh oh!

sdodson commented Apr 3, 2019

Uh oh!

smarterclayton commented Apr 9, 2019

Uh oh!

wking commented Apr 9, 2019

Uh oh!

wking commented Apr 11, 2019

Uh oh!

openshift-ci-robot commented Apr 11, 2019

Uh oh!

wking commented Mar 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wking commented Mar 30, 2019 •

edited

Loading

wking Apr 1, 2019 •

edited

Loading