-
Notifications
You must be signed in to change notification settings - Fork 250
Bug 2071998: pkg/verify: Expose store error details, especially slow access #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2071998: pkg/verify: Expose store error details, especially slow access #1371
Conversation
This function calls the callback for each signature, it doesn't return a list. Fixes a godoc from the original implementation in c43c2cb (Create a resuable verify package for image release verification, 2020-07-16, openshift#837).
Extending 1b9753d (pkg/verify: Expose underlying signature errors, 2022-04-21, openshift#1358) with timestamps, so it's easy to see what's slow in situations like [1] where some portion of signature verification is surprisingly slow, and it's currently not clear what aspect is causing the slowdown. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
Extending 1b9753d (pkg/verify: Expose underlying signature errors, 2022-04-21, openshift#1358) with information about when store retrieval is exhausted, so it's easy to see what's slow in situations like [1] where some portion of signature verification is surprisingly slow, and it's currently not clear what aspect is causing the slowdown. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
We can hit this case if lookup fails (e.g. because it takes too long to retrieve over HTTP, and the Context times out). Instead of failing with just "context deadline exceeded", timestamp this error, roll it in with the others, and fall through to consider any remaining verifiers who have not yet been satisfied.
…n context cancel From the spec [1]: If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. So in this callback, we have been called by Signatures, and have a signature/errIn we'd like to pass back through responses. But maybe responses is full. In that case, we try to block on the write, in case we get some space to write later on. But we don't want to block forever. Previously, the parallel ctx.Done() would bail us out of a too-long wait (good), but in cases where it won the select random-choice we would discard the response data we'd been trying to send (bad). With this commit, the ctx.Done() case gets a single, non-blocking write attempt, so we can pass along the signature/errIn information if there happens to be space. If not, there will be plenty of earlier responses in that channel that we'll be able to process later. While I'm touching things here, pivot from 'true, nil' to 'false, ctx.Err()' to show that we're a bit grumpy about being canceled. The errorChannel collector is going to silently discard the Canceled error when it's aggregating errorChannel content, but the presence of the error will make it less likely that later logic pivots misinterpret this as "successful validation". [1]: https://go.dev/ref/spec#Select_statements
|
@wking: This pull request references Bugzilla bug 2071998, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
78e4b2d to
550d7f6
Compare
| } | ||
| } | ||
| } | ||
| if done, err := fn(ctx, nil, fmt.Errorf("prefix %s in config map %s: %w", prefix, cm.ObjectMeta.Name, store.ErrNotFound)); err != nil || done { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing to me. It appears that all it's going to do is log the error prefix... but why? At this point it's not clear to me that there's been an error but I can see from the other changes in this commit new error related stuff has been added. Not clear how it all hangs together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b594c76 pivots Signatures from:
Not finding any acceptable signatures is not an error; it is up to the caller to handle that case.
to:
Not finding additional signatures should result in a callback call with an error wrapping ErrNotFound, to allow the caller to figure out when and why the store was unable to find a signature. When a store has several lookup mechanisms, this may result in several callback calls with different ErrNotFound. Signatures itself should return nil in this case, because eventually running out of signatures is an expected part of any invocation where the callback calls never return done=true.
So here I'm saying:
I've exhausted this particular ConfigMap's signatures, and you still aren't happy.
ErrNotFound(wrapped up with some context about who I am), to let you know that I'm done. Maybe I'll have another ConfigMap with more signatures in a bit. Or maybe not. But theErrNotFoundwill make it easy for your to track/timestamp my progress through these ConfigMaps.
Ensure that we have access to this data, so it's easy to see what's slow in situations like [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
550d7f6 to
4f76483
Compare
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 2071998 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Picking up openshift/library-go@1b9753d298 (Bug 2071998: pkg/verify: Expose underlying signature errors, 2022-04-21, openshift/library-go#1358) and openshift/library-go#1371. Generated with: $ go get -u github.com/openshift/library-go $ go mod tidy $ go mod vendor $ git add -A go.* vendor using: $ go version go version go1.17.3 linux/amd64
Extending #1358 with more details and a test-case that confirms we will get timestamps and URIs when a HTTP request times out or is canceled. Bunch of fiddly pivots here, so it's a number of small commits. See the individual commit messages for a description of why I'm making the pivot that I'm making in that commit.