Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 19, 2022

Extending #1358 with more details and a test-case that confirms we will get timestamps and URIs when a HTTP request times out or is canceled. Bunch of fiddly pivots here, so it's a number of small commits. See the individual commit messages for a description of why I'm making the pivot that I'm making in that commit.

wking added 5 commits May 19, 2022 01:41
This function calls the callback for each signature, it doesn't return
a list.  Fixes a godoc from the original implementation in c43c2cb
(Create a resuable verify package for image release verification,
2020-07-16, openshift#837).
Extending 1b9753d (pkg/verify: Expose underlying signature errors,
2022-04-21, openshift#1358) with timestamps, so it's easy to see what's slow in
situations like [1] where some portion of signature verification is
surprisingly slow, and it's currently not clear what aspect is causing
the slowdown.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
Extending 1b9753d (pkg/verify: Expose underlying signature errors,
2022-04-21, openshift#1358) with information about when store retrieval is
exhausted, so it's easy to see what's slow in situations like [1]
where some portion of signature verification is surprisingly slow, and
it's currently not clear what aspect is causing the slowdown.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
We can hit this case if lookup fails (e.g. because it takes too long
to retrieve over HTTP, and the Context times out).  Instead of failing
with just "context deadline exceeded", timestamp this error, roll it
in with the others, and fall through to consider any remaining
verifiers who have not yet been satisfied.
…n context cancel

From the spec [1]:

  If one or more of the communications can proceed, a single one that
  can proceed is chosen via a uniform pseudo-random
  selection. Otherwise, if there is a default case, that case is
  chosen. If there is no default case, the "select" statement blocks
  until at least one of the communications can proceed.

So in this callback, we have been called by Signatures, and have a
signature/errIn we'd like to pass back through responses.  But maybe
responses is full.  In that case, we try to block on the write, in
case we get some space to write later on.  But we don't want to block
forever.  Previously, the parallel ctx.Done() would bail us out of a
too-long wait (good), but in cases where it won the select
random-choice we would discard the response data we'd been trying to
send (bad).  With this commit, the ctx.Done() case gets a single,
non-blocking write attempt, so we can pass along the signature/errIn
information if there happens to be space.  If not, there will be
plenty of earlier responses in that channel that we'll be able to
process later.

While I'm touching things here, pivot from 'true, nil' to 'false,
ctx.Err()' to show that we're a bit grumpy about being canceled.  The
errorChannel collector is going to silently discard the Canceled error
when it's aggregating errorChannel content, but the presence of the
error will make it less likely that later logic pivots misinterpret
this as "successful validation".

[1]: https://go.dev/ref/spec#Select_statements
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2022

@wking: This pull request references Bugzilla bug 2071998, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

Bug 2071998: pkg/verify: Expose store error details, especially slow access

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from deads2k, jiajliu and sttts May 19, 2022 10:00
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2022
@wking wking force-pushed the verify-store-error-details branch from 78e4b2d to 550d7f6 Compare May 19, 2022 10:09
}
}
}
if done, err := fn(ctx, nil, fmt.Errorf("prefix %s in config map %s: %w", prefix, cm.ObjectMeta.Name, store.ErrNotFound)); err != nil || done {
Copy link
Contributor

@jottofar jottofar May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me. It appears that all it's going to do is log the error prefix... but why? At this point it's not clear to me that there's been an error but I can see from the other changes in this commit new error related stuff has been added. Not clear how it all hangs together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b594c76 pivots Signatures from:

Not finding any acceptable signatures is not an error; it is up to the caller to handle that case.

to:

Not finding additional signatures should result in a callback call with an error wrapping ErrNotFound, to allow the caller to figure out when and why the store was unable to find a signature. When a store has several lookup mechanisms, this may result in several callback calls with different ErrNotFound. Signatures itself should return nil in this case, because eventually running out of signatures is an expected part of any invocation where the callback calls never return done=true.

So here I'm saying:

I've exhausted this particular ConfigMap's signatures, and you still aren't happy. ErrNotFound (wrapped up with some context about who I am), to let you know that I'm done. Maybe I'll have another ConfigMap with more signatures in a bit. Or maybe not. But the ErrNotFound will make it easy for your to track/timestamp my progress through these ConfigMaps.

Ensure that we have access to this data, so it's easy to see what's
slow in situations like [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
@wking wking force-pushed the verify-store-error-details branch from 550d7f6 to 4f76483 Compare May 19, 2022 19:17
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jottofar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 5bcfed8 into openshift:master May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

@wking: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 2071998 has not been moved to the MODIFIED state.

Details

In response to this:

Bug 2071998: pkg/verify: Expose store error details, especially slow access

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the verify-store-error-details branch May 23, 2022 17:51
wking added a commit to wking/cluster-version-operator that referenced this pull request May 23, 2022
Picking up openshift/library-go@1b9753d298 (Bug 2071998: pkg/verify:
Expose underlying signature errors, 2022-04-21,
openshift/library-go#1358) and openshift/library-go#1371.  Generated
with:

  $ go get -u github.com/openshift/library-go
  $ go mod tidy
  $ go mod vendor
  $ git add -A go.* vendor

using:

  $ go version
  go version go1.17.3 linux/amd64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants