Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 27, 2020

Picking #349 and #424 back to 4.5 so we release the leader lock when we exit, which in turn allows the incoming CVO pod to pick up the leader lock and update the ClusterVersion status in a timely manner.

A number of mostly-minor manual tweaks to adjust to 4.5, which is missing #358's TLS metrics and #406 and #410's Context changes.

wking and others added 18 commits June 3, 2020 13:15
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

[1]: https://golang.org/pkg/context/#Context
… as success

For [1].  Before this commit, you could have a flow like:

1. SyncWorker.Start()
2. External code calls Update(), e.g. after noticing a ClusterVersion
   spec.desiredUpdate change.
3. Update sets w.work to point at the desired payload.
4. Start's Until loop is triggered via w.notify.
5. Start calls calculateNext, which notices the change and sets the
   state to UpdatingPayload.
6. Start calculates a syncTimeout and calls syncOnce.
7. syncOnce notices the new payload and loads it.
8. For whatever reason, payload retrieval takes a while.  Blackholed
   signature-fetch attempts in a restricted network [2] are one
   example of something that could be slow here.  Eventually the
   syncTimeout kicks in and signature fetching or other verification
   is canceled (which counts as failed verification).
9. Force is true, so syncOnce logs "Forcing past precondition
   failures..." but carries on to call apply.
10. apply computes the manifest graph, runs the ClusterOperator
    precreation (whose handlers return ContextError() right after
    spawning, because the context is already expired), and runs the
    main RunGraph (whose handlers do the same).
11. The main RunGraph returns at a slice with a handful of context
    errors and nothing else.  apply passes this on to
    consistentReporter.Errors.
12. consistentReporter.Errors calls summarizeTaskGraphErrors, which
    logs "All errors were context errors..." and returns nil to avoid
    alarming consistentReporter.Errors (we don't want to put this in
    our ClusterVersion status and alarm users).
13. apply returns the summarized nil to syncOnce.
14. syncOnce returns the summarized nil to Start.
15. Start logs "Sync succeeded..." and flops into ReconcilingPayload
    for the next round.
16. Start comes into the next round in reconciling mode despite never
    having attempted to apply any manifests in its Updating mode.  The
    manifest graph gets flattened and shuffled and all kinds of
    terrible things could happen like the machine-config trying to
    roll out the newer machine-os-content and its 4.4 hyperkube binary
    before rolling out prerequisites like the 4.4 kube-apiserver
    operator.

With this commit, the process is the same through 12, but ends with:

13. apply returns the first context error to syncOnce.
14. syncOnce returns that error to Start.
15. Start backs off and comes in again with a second attempt at
    UpdatingPayload.
16. Manifests get pushed in the intended order, and nothing explodes.

The race fixed in this commit could also have come up without timing
out the payload pull/verification, e.g. by having a perfectly slow
ClusterOperator preconditions.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1838497#c23
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1840343
…ry-pick-372-to-release-4.5

Bug 1843526: pkg/cvo/sync_worker: Do not treat "All errors were context errors..." as success
Before this change, only the initialDelaySeconds field of probes could be
updated. This patch expands the set of supported fields to include all the
other int32 fields of probes so that CVO will roll out such changes.
…ry-pick-383-to-release-4.5

[release-4.5] Bug 1848729: Expand supported set of probe field mutations
Using replace directive to force the 0.3.3 version

Signed-off-by: Lalatendu Mohanty <[email protected]>
Because every Failing=True condition should have a reason.

Also wordsmith the user-facing docs to replace "synchronize" with
"reconcile", because our merge logic is more nuanced than the complete
match "synchronize" implies for me.

The ClusterOperatorNotAvailable special casing landed with
convertErrorToProgressing in c2ac20f (status: Report the operators
that have not yet deployed, 2019-04-09, openshift#158).
Reduce false-positives when operators take a while to level (like the
machine-config operator, which has to roll the control plane
machines).  We may want to raise this further in the future, but baby
steps ;).

The previous 10-minute value is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158), which doesn't
make a case for that specific value.  So the bump is unlikely to break
anything unexpected.
…imeout-4.5

Bug 1866480: pkg/cvo/status: Raise Operator leveling grace-period to 20 minutes
…ry-pick-408-to-release-4.5

Bug 1857682: [release-4.5] Updating the golang.org/x/text version to v0.3.3
We've had it since 2b81f47 (cvo: Release our leader lease when we
are gracefully terminated, 2019-01-16, openshift#87), but it's redundant
vs. "ListenAddr is not an empty string".

I'm also switching to:

  o.ListenAddr != ""

instead of:

  len(o.ListenAddr) > 0

because it seems slightly easier to understand, but obviously either
will work.

Cherry-picked from 07e5809 (openshift#349), around conflicts due to the lack
of TLS metrics in 4.5.
Somewhat like the example in [1].  This pushes the server management
down into a new RunMetrics method, which we then run in its own
goroutine.  This is initial groundwork; I expect we will port more of
our child goroutines to this framework in follow-up work.

Cherry-picked from b30aa0e (openshift#349), around conflicts due to the lack
of TLS metrics in 4.5.

[1]: https://golang.org/pkg/net/http/#Server.Shutdown
Pulling this up out of cvo.New() while working to decouple metrics
handling from the core CVO goroutine.
These are not just for available updates, they're also for downloading
signatures.  Placing them in a separate file makes it easier to focus
on the code that is specific to available updates.
So the incoming cluster-version operator doesn't need to wait for the
outgoing operator's lease to expire, which can take a while [1]:

  I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
  ...
  I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

and time out the:

  Cluster did not acknowledge request to upgrade in a reasonable time

testcase [2].  Using ReleaseOnCancel has been the plan since
2b81f47 (cvo: Release our leader lease when we are gracefully
terminated, 2019-01-16, openshift#87).  I'm not clear on why it (sometimes?)
doesn't work today.

The discrepancy between the "exit after 2s no matter what" comment and
the 5s After dates back to dbedb7a (cvo: When the CVO restarts,
perform one final sync to write status, 2019-04-27, openshift#179), which
bumped the After from 2s to 5s, but forgot to bump the comment.  I'm
removing that code here in favor of the two-minute timeout from
b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349).
We still exit immediately on a second TERM, for folks who get
impatient waiting for the graceful timeout.

Decouple shutdownContext from the context passed into Options.run, to
allow TestIntegrationCVO_gracefulStepDown to request a graceful
shutdown.  And remove Context.Start(), inlining the logic in
Options.run so we can count and reap the goroutines it used to launch.
This also allows us to be more targeted with the context for each
goroutines:

* Informers are now launched before the lease controller, so they're
  up and running by the time we acquire the lease.  They remain
  running until the main operator CVO.Run() exits, after which we shut
  them down.  Having informers running before we have a lease is
  somewhat expensive in terms of API traffic, but we should rarely
  have two CVO pods competing for leadership since we transitioned to
  the Recreate Deployment strategy in 078686d
  (install/0000_00_cluster-version-operator_03_deployment: Set
  'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527
  (install/0000_00_cluster-version-operator_03_deployment: Fix
  Recreate strategy, 2019-04-03, openshift#155).  I don't see a way to block on
  their internal goroutine's completion, but maybe informers will grow
  an API for that in the future.

* The metrics server also continues to run until CVO.Run() exits,
  where previously we began gracefully shutting it down at the same
  time we started shutting down CVO.Run().  This ensures we are around
  and publishing any last-minute CVO.Run() changes.

* Leader election also continues to run until CVO.Run() exits.  We
  don't want to release the lease while we're still controlling
  things.

* CVO.Run() and AutoUpdate.Run() both stop immediately when the
  passed-in context is canceled or we call runCancel internally
  (because of a TERM, error from a goroutine, or loss of leadership).
  These are the only two goroutines that are actually writing to the
  API servers, so we want to shut them down as quickly as possible.

Drop an unnecessary runCancel() from the "shutting down" branch of the
error collector.  I'd added it in b30aa0e, but you can only ever
get into the "shutting down" branch if runCancel has already been
called.  And fix the scoping for the shutdownTimer variable so we
don't clear it on each for-loop iteration (oops :p, bug from
b30aa0e).

Add some logging to the error collector, so it's easier to see where
we are in the collection process from the operator logs.  Also start
logging collected goroutines by name, so we can figure out which may
still be outstanding.

Set terminationGracePeriodSeconds 130 to extend the default 30s [3],
to give the container the full two-minute graceful timeout window
before the kubelet steps in with a KILL.

Push the Background() initialization all the way up to the
command-line handler, to make it more obvious that the context is
scoped to the whole 'start' invocation.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7
[3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core

Cherry picked from cc1921d (openshift#424), around conflicts due to the lack
of TLS metrics and the Context pivots in 4.5.
…ap lock release

From the godocs:

  $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go
    // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and
    // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not
    // attempt to acquire leases with empty identities and will wait for the full lease
    // interval to expire before attempting to reacquire. This value is set to empty when
    // a client voluntarily steps down.
    HolderIdentity       string      `json:"holderIdentity"`

The previous assumption that the release would involve ConfigMap
deletion was born with the test in 2b81f47 (cvo: Release our leader
lease when we are gracefully terminated, 2019-01-16, openshift#87).

Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.
Clayton wants these in each goroutine we launch [1].  Obviously
there's no way to reach inside the informer Start()s and add it there.
I'm also adding this to the FIXME comment for rerolling the
auto-update worker goroutines; we'll get those straigtened out in
future work.

Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack
of Context pivots in 4.5.

[1]: openshift#424
Lala wanted the version included in the outgoing log line [1].  I'm
not sure why you'd be wondering which version of the CVO code was
running for that particular line, and not for other lines in the log,
but including the version there is easy enough.

While we're thinking about logging the CVO version, also remove the
useless %s formatting from the opening log line, because we don't need
to manipulate version.String at all.

[1]: openshift#424 (comment)
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 27, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1872906, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "4.5.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1872906:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2020
@wking
Copy link
Member Author

wking commented Aug 27, 2020

Oops, wrong target branch.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/integration a42bfb7 link /test integration

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

Details

In response to this:

Oops, wrong target branch.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking changed the title Bug 1872906: Bug 1872906: pkg/start: Release leader lease on graceful shutdown Aug 27, 2020
@wking
Copy link
Member Author

wking commented Aug 27, 2020

Filed #446 against release-4.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants