-
Notifications
You must be signed in to change notification settings - Fork 213
pkg/cvo: Report current tasks #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on [1]. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install. The delete-pointer-while-preserving-order lines in FinishTask() are from [2]: Delete copy(a[i:], a[i+1:]) a[len(a)-1] = nil // or the zero value of T a = a[:len(a)-1] The reflect.DeepEqual addition avoids: ./sync_worker_test.go:73:13: invalid operation: evt != tt.next (struct containing []*payload.Task cannot be compared) [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1691513#c17 [2]: https://github.com/golang/go/wiki/SliceTricks
9ceecc0 to
bd00466
Compare
|
This gets long: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/152/pull-ci-openshift-cluster-version-operator-master-e2e-aws/535/artifacts/e2e-aws/installer/.openshift_install.log | grep -A1 '% complete'
time="2019-03-30T07:40:01Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 63% complete (Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, OAuthClient console, Role openshift-image-registry/prometheus-k8s, ServiceMonitor openshift-kube-controller-manager-operator/kube-controller-manager-operator, ServiceMonitor openshift-kube-apiserver-operator/kube-apiserver-operator, ServiceAccount openshift-monitoring/cluster-monitoring-operator, ServiceMonitor openshift-controller-manager-operator/openshift-controller-manager-operator, ServiceMonitor openshift-kube-scheduler-operator/kube-scheduler-operator, ServiceAccount openshift-cluster-samples-operator/cluster-samples-operator, ServiceMonitor openshift-apiserver-operator/openshift-apiserver-operator, ConfigMap openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-config, ServiceAccount openshift-cluster-node-tuning-operator/cluster-node-tuning-operator, ConfigMap openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator-config, RoleBinding openshift-cluster-storage-operator/cluster-storage-operator, ConfigMap openshift-authentication-operator/openshift-authentication-operator-config, ServiceAccount openshift-marketplace/marketplace-operator, ClusterOperator machine-config, ClusterOperator kube-apiserver, ClusterOperator kube-controller-manager, ClusterOperator kube-scheduler, ClusterOperator dns, ClusterOperator openshift-apiserver, ClusterOperator openshift-controller-manager, ClusterOperator cluster-autoscaler, ClusterOperator service-ca, ClusterOperator cloud-credential, RoleBinding openshift-ingress-operator/ingress-operator, ServiceAccount openshift-image-registry/cluster-image-registry-operator, ClusterOperator machine-api)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 2% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CustomResourceDefinition credentialsrequests.cloudcredential.openshift.io, Role openshift-kube-scheduler-operator/prometheus-k8s, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-machine-config-operator, ClusterRole openshift-ingress-operator, Namespace openshift-monitoring, Namespace openshift-cluster-node-tuning-operator, CustomResourceDefinition configs.samples.operator.openshift.io, Namespace openshift-service-catalog-apiserver-operator, Namespace openshift-service-catalog-controller-manager-operator, Namespace openshift-cluster-version, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, ClusterRole registry-monitoring, Role openshift-controller-manager-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-operators, CustomResourceDefinition kubeapiservers.operator.openshift.io, Namespace openshift-image-registry)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 6% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-image-registry, ClusterRoleBinding cluster-storage-operator, Namespace openshift-cloud-credential-operator, CustomResourceDefinition mcoconfigs.machineconfiguration.openshift.io, CustomResourceDefinition ingresscontrollers.operator.openshift.io, Namespace openshift-cluster-samples-operator, CustomResourceDefinition servicecatalogapiservers.operator.openshift.io, ClusterRole cluster-monitoring-operator, CustomResourceDefinition tuneds.tuned.openshift.io, ClusterRoleBinding registry-monitoring, RoleBinding openshift-kube-scheduler-operator/prometheus-k8s, CustomResourceDefinition clusteroperators.config.openshift.io, CustomResourceDefinition servicecatalogcontrollermanagers.operator.openshift.io, Service openshift-kube-apiserver-operator/metrics, ClusterRole system:controller:operator-lifecycle-manager)"
...
time="2019-03-30T07:44:41Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:13Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:26Z" level=debug msg="Cluster is initialized"On the one hand, it's either debug-level, where it's ok to be chatty, or a timeout error, where we don't want to remove whatever the critical, blocking task was. But maybe we want to truncate non-errors somewhat client-side (helped by the fact that the longest in-progress tasks are first in the list)? Is there a way to provide this as structured data vs. my current long string so the web console, etc. can make a slick UX for this? I expect the less-parallel upgrades will still have long entries, but they should be more manageable than these super-parallel install entries. |
|
@wking: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| case status.Fraction > 0: | ||
| message = fmt.Sprintf("Working towards %s: %.0f%% complete", version, status.Fraction*100) | ||
| tasks := make([]string, 0, len(status.Current)) | ||
| for _, task := range status.Current { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be too large, and I don't want to show resource names here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be too large...
Can you think of away to attach as structured data, so the consumer can make the verbosity/rendering calls?
| for _, task := range status.Current { | ||
| tasks = append(tasks, task.KindName()) | ||
| } | ||
| message = fmt.Sprintf("Working towards %s: %.0f%% complete (%s)", version, status.Fraction*100, strings.Join(tasks, ", ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that should ever show up in this message is 1-2 operator names, that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that should ever show up in this message is 1-2 operator names, that's it.
So what to show when you're, say, blocked on a RoleBinding? Collapse to the namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)
The issue here is that the installer doesn't know about sync periods, and wants to know what was blocking the CVO when the installer decided to give up on waiting. If the CVO starts right in on another sync cycle (which it does, right?), it seems very unlikely that the last message received by the installer would include blockers, unless we include them in every message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)
This or the namespace is sufficient to find the next level of debugging. We have enough operators that indicating which one the CVO is stuck on is pretty useful. In cases where we aren't currently waiting on a CO, knowing the namespace is good enough for most purposes.
|
/hold I see people getting antsy |
controllers in k8s do not include info for all steps into the status objects. controllers like kubelet, deployment etc.. use events to provide information for steps.. like kubelet adds events like image pulled, started, failed due to waiting for volume etc.. deployment controller scaled new replicaset to 1, scalling old replicaset to 0 etc... so including what CVO is doing into status is not correct IMO. |
|
|
||
| for i, tsk := range r.status.Current { | ||
| if tsk == task { | ||
| copy(r.status.Current[i:], r.status.Current[i+1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not great :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not great :(
This is the recommended way to delete from a slice of pointers while preserving ordering, per this. Did you have an alternative you prefer?
So why even include percent complete? Some short string like the current percent complete is useful for raw |
|
The main thing I want to avoid is the Installer / Installer team having to triage every cluster installation failure. Either we need to report which operators weren't successful or make the next debugging steps clear enough that we help the user pinpoint which component is failing themselves. |
|
See #158 for an alternative that while slightly more complex focuses on trying to hide details at a lower level. As a bonus it improves failure messages when multiple payloads are failing. If the installer today isn't printing the Failing condition if the CVO is reporting it, we should also change the output. That will display a more detailed message than today. |
We prefer the Failing message and fall back on Progressing when Failing is not set (code). |
|
Obsoleted by #158. /close |
|
@wking: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Also in this space: rhbz#1768255. |
When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install.
The delete-pointer-while-preserving-order lines in
FinishTask()are from here.The
reflect.DeepEqualaddition avoids: