Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Mar 30, 2019

When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install.

The delete-pointer-while-preserving-order lines in FinishTask() are from here.

The reflect.DeepEqual addition avoids:

./sync_worker_test.go:73:13: invalid operation: evt != tt.next (struct containing []*payload.Task cannot be compared)

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2019
When the CVO reports "98% complete" and similar, that's nice, but it
would be useful to know what it was still working on [1].  For
example, this would provide at least a starting point for figuring out
what component(s) were blocking an upgrade or install.

The delete-pointer-while-preserving-order lines in FinishTask() are
from [2]:

  Delete

  copy(a[i:], a[i+1:])
  a[len(a)-1] = nil // or the zero value of T
  a = a[:len(a)-1]

The reflect.DeepEqual addition avoids:

  ./sync_worker_test.go:73:13: invalid operation: evt != tt.next (struct containing []*payload.Task cannot be compared)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1691513#c17
[2]: https://github.com/golang/go/wiki/SliceTricks
@wking wking force-pushed the report-current-tasks branch from 9ceecc0 to bd00466 Compare March 30, 2019 07:17
@wking
Copy link
Member Author

wking commented Mar 30, 2019

This gets long:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/152/pull-ci-openshift-cluster-version-operator-master-e2e-aws/535/artifacts/e2e-aws/installer/.openshift_install.log | grep -A1 '% complete'
time="2019-03-30T07:40:01Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 63% complete (Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, OAuthClient console, Role openshift-image-registry/prometheus-k8s, ServiceMonitor openshift-kube-controller-manager-operator/kube-controller-manager-operator, ServiceMonitor openshift-kube-apiserver-operator/kube-apiserver-operator, ServiceAccount openshift-monitoring/cluster-monitoring-operator, ServiceMonitor openshift-controller-manager-operator/openshift-controller-manager-operator, ServiceMonitor openshift-kube-scheduler-operator/kube-scheduler-operator, ServiceAccount openshift-cluster-samples-operator/cluster-samples-operator, ServiceMonitor openshift-apiserver-operator/openshift-apiserver-operator, ConfigMap openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-config, ServiceAccount openshift-cluster-node-tuning-operator/cluster-node-tuning-operator, ConfigMap openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator-config, RoleBinding openshift-cluster-storage-operator/cluster-storage-operator, ConfigMap openshift-authentication-operator/openshift-authentication-operator-config, ServiceAccount openshift-marketplace/marketplace-operator, ClusterOperator machine-config, ClusterOperator kube-apiserver, ClusterOperator kube-controller-manager, ClusterOperator kube-scheduler, ClusterOperator dns, ClusterOperator openshift-apiserver, ClusterOperator openshift-controller-manager, ClusterOperator cluster-autoscaler, ClusterOperator service-ca, ClusterOperator cloud-credential, RoleBinding openshift-ingress-operator/ingress-operator, ServiceAccount openshift-image-registry/cluster-image-registry-operator, ClusterOperator machine-api)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 2% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CustomResourceDefinition credentialsrequests.cloudcredential.openshift.io, Role openshift-kube-scheduler-operator/prometheus-k8s, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-machine-config-operator, ClusterRole openshift-ingress-operator, Namespace openshift-monitoring, Namespace openshift-cluster-node-tuning-operator, CustomResourceDefinition configs.samples.operator.openshift.io, Namespace openshift-service-catalog-apiserver-operator, Namespace openshift-service-catalog-controller-manager-operator, Namespace openshift-cluster-version, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, ClusterRole registry-monitoring, Role openshift-controller-manager-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-operators, CustomResourceDefinition kubeapiservers.operator.openshift.io, Namespace openshift-image-registry)"
time="2019-03-30T07:41:11Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 6% complete (Namespace openshift-kube-scheduler-operator, Namespace openshift-controller-manager-operator, Role openshift-kube-controller-manager-operator/prometheus-k8s, Role openshift-kube-apiserver-operator/prometheus-k8s, Service openshift-cluster-version/cluster-version-operator, Role openshift-service-catalog-controller-manager-operator/prometheus-k8s, Role openshift-apiserver-operator/prometheus-k8s, CustomResourceDefinition apiservers.config.openshift.io, Namespace openshift-network-operator, Namespace openshift-cluster-machine-approver, Deployment openshift-cluster-machine-approver/machine-approver, Namespace openshift-machine-api, ClusterRole system:openshift:controller:machine-approver, CustomResourceDefinition consoles.operator.openshift.io, ClusterRoleBinding system:openshift:operator:service-ca-operator, ClusterRoleBinding system:openshift:operator:authentication, CredentialsRequest openshift-cloud-credential-operator/openshift-machine-api, Namespace openshift-marketplace, Role openshift-service-catalog-apiserver-operator/prometheus-k8s, Namespace openshift-kube-controller-manager-operator, ClusterRoleBinding system:openshift:controller:machine-approver, Namespace openshift-apiserver-operator, ClusterRole openshift-dns-operator, Namespace openshift-image-registry, ClusterRoleBinding cluster-storage-operator, Namespace openshift-cloud-credential-operator, CustomResourceDefinition mcoconfigs.machineconfiguration.openshift.io, CustomResourceDefinition ingresscontrollers.operator.openshift.io, Namespace openshift-cluster-samples-operator, CustomResourceDefinition servicecatalogapiservers.operator.openshift.io, ClusterRole cluster-monitoring-operator, CustomResourceDefinition tuneds.tuned.openshift.io, ClusterRoleBinding registry-monitoring, RoleBinding openshift-kube-scheduler-operator/prometheus-k8s, CustomResourceDefinition clusteroperators.config.openshift.io, CustomResourceDefinition servicecatalogcontrollermanagers.operator.openshift.io, Service openshift-kube-apiserver-operator/metrics, ClusterRole system:controller:operator-lifecycle-manager)"
...
time="2019-03-30T07:44:41Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:13Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 0.0.1-2019-03-30-071955: 99% complete (ClusterOperator authentication)"
time="2019-03-30T07:46:26Z" level=debug msg="Cluster is initialized"

On the one hand, it's either debug-level, where it's ok to be chatty, or a timeout error, where we don't want to remove whatever the critical, blocking task was. But maybe we want to truncate non-errors somewhat client-side (helped by the fact that the longest in-progress tasks are first in the list)? Is there a way to provide this as structured data vs. my current long string so the web console, etc. can make a slick UX for this?

I expect the less-parallel upgrades will still have long entries, but they should be more manageable than these super-parallel install entries.

@openshift-ci-robot
Copy link
Contributor

@wking: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

case status.Fraction > 0:
message = fmt.Sprintf("Working towards %s: %.0f%% complete", version, status.Fraction*100)
tasks := make([]string, 0, len(status.Current))
for _, task := range status.Current {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be too large, and I don't want to show resource names here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be too large...

Can you think of away to attach as structured data, so the consumer can make the verbosity/rendering calls?

for _, task := range status.Current {
tasks = append(tasks, task.KindName())
}
message = fmt.Sprintf("Working towards %s: %.0f%% complete (%s)", version, status.Fraction*100, strings.Join(tasks, ", "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that should ever show up in this message is 1-2 operator names, that's it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that should ever show up in this message is 1-2 operator names, that's it.

So what to show when you're, say, blocked on a RoleBinding? Collapse to the namespace?

Copy link
Member Author

@wking wking Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

The issue here is that the installer doesn't know about sync periods, and wants to know what was blocking the CVO when the installer decided to give up on waiting. If the CVO starts right in on another sync cycle (which it does, right?), it seems very unlikely that the last message received by the installer would include blockers, unless we include them in every message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to keep showing operator names either. The only time we should have anything other than a % is when a specific operator has taken longer than a certain amount of time (has completed a sync period)

This or the namespace is sufficient to find the next level of debugging. We have enough operators that indicating which one the CVO is stuck on is pretty useful. In cases where we aren't currently waiting on a CO, knowing the namespace is good enough for most purposes.

@smarterclayton
Copy link
Contributor

/hold

I see people getting antsy

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2019
@abhinavdahiya
Copy link
Contributor

When the CVO reports "98% complete" and similar, that's nice, but it would be useful to know what it was still working on. For example, this would provide at least a starting point for figuring out what component(s) were blocking an upgrade or install.

controllers in k8s do not include info for all steps into the status objects. controllers like kubelet, deployment etc.. use events to provide information for steps.. like kubelet adds events like image pulled, started, failed due to waiting for volume etc.. deployment controller scaled new replicaset to 1, scalling old replicaset to 0 etc...

so including what CVO is doing into status is not correct IMO.


for i, tsk := range r.status.Current {
if tsk == task {
copy(r.status.Current[i:], r.status.Current[i+1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not great :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not great :(

This is the recommended way to delete from a slice of pointers while preserving ordering, per this. Did you have an alternative you prefer?

@wking
Copy link
Member Author

wking commented Apr 1, 2019

so including what CVO is doing into status is not correct IMO.

So why even include percent complete? Some short string like the current percent complete is useful for raw oc get ... output and similar, but I think both the installer and web-console upgrade UI would benefit from more-detailed, structured status information. If that's a separate call to some health-check or status endpoint, that's fine with me. Or is this not possible, and we need the installer and web-console to reach around the CVO and look at ClusterOperator and similar directly when they want more details?

@sdodson
Copy link
Member

sdodson commented Apr 3, 2019

The main thing I want to avoid is the Installer / Installer team having to triage every cluster installation failure. Either we need to report which operators weren't successful or make the next debugging steps clear enough that we help the user pinpoint which component is failing themselves.

@smarterclayton
Copy link
Contributor

See #158 for an alternative that while slightly more complex focuses on trying to hide details at a lower level. As a bonus it improves failure messages when multiple payloads are failing.

If the installer today isn't printing the Failing condition if the CVO is reporting it, we should also change the output. That will display a more detailed message than today.

@wking
Copy link
Member Author

wking commented Apr 9, 2019

If the installer today isn't printing the Failing condition if the CVO is reporting it, we should also change the output.

We prefer the Failing message and fall back on Progressing when Failing is not set (code).

@wking
Copy link
Member Author

wking commented Apr 11, 2019

Obsoleted by #158.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

Details

In response to this:

Obsoleted by #158.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Mar 13, 2020

Also in this space: rhbz#1768255.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants