-
Notifications
You must be signed in to change notification settings - Fork 88
Description
The Problem
We are finding that sometimes when we scan a manifest that has already been scanned we will end up with an empty IndexReport stored for that manifest:
{
"manifest_hash": "sha256:a2f196cc1f224ba665cde842a5be1d3c7ded333c279525242eba123677152172",
"state": "",
"packages": {},
"distributions": {},
"repository": {},
"environments": {},
"success": false,
"err": ""
}Once in this state any further request to rescan it will not do anything to change the IndexReport. The logs from the call to scan the previously scanned manifest show:
info index request start
debug locking attempt
debug locking OK
info starting scan
info index request done
The logs on any future attempts show an almost identical output except they include the manifest already scanned line:
info index request start
debug locking attempt
debug locking OK
info starting scan
info manifest already scanned
info index request done
Suspected Code Path
Looking at the code I think this happens when an attempt to get the IndexReport in checkManifest timeouts. The flow I believe is:
- The
Controllerenters the run method and callsCheckManifestas the first state. At this point theIndexReporton the controller is empty checkManifestdiscovers that the manifest is already scannedcheckManifestattempts to load theIndexReportbut the call tos.Store.IndexReporttimes out (5s timeout) and an err is returned.checkManifestdoes not log anything in this case andIndexReporton the controller is not set because it was not loaded.- Back in the calling controller it goes through the checks as to whether an error occurred. As this is a timeout
errors.Is(err, context.DeadlineExceeded)will be true and it will go into that case - As the timeout case does not exit the
forloop early it just sets theretryvariable to true theControllerthen proceeds to store theIndexReportinto the database. This stores an emptyIndexReportinto the database. - The
Controllerthen continues by going into the retry logic which pauses for a bit and then setserrto nil. - When
checkManifestreturned it returned the next state asTerminalsoControllernow sees this and exits
I think this is consistent with what we see in the logs where we get no logging at all indicating an error has occurred. Nowhere in the code path above does a log get emitted with the error from checkManifest. Future calls to scan on this manifest follow a very similar path except the call to IndexReport does not timeout but all it loads is an empty IndexReport, it then emits the manifest already scanned log line and does not return an error but we are left with the empty IndexReport which check manifests thinks is OK.
The solution
I think there are a couple of things to be addressed here. I'm happy to work on a PR but would like advice first as I think some change to behaviour is required.
Retry logic in controller
I do not think that the retry is doing anything. Every state that I can see that returns an error also sets the next state to Terminal. This means if any of these errors are timeouts then they will go through the retry logic path of having a small pause before seeing that the next state is Terminal and exiting.
Personally I would delete the errors.Is(err, context.DeadlineExceeded) case and retry variables rather than attempting to fix them (the fix would be to change that case to have what is currently inside the if retryblock and put acontinueon the end so it goes back around the loop without changing thenextstate and without persisting a potentially invalidIndexReport`).
Checking the validity of manifests in checkManifest
Fixing the retry logic maybe enough to fix this but care would need to be taken to not persist a bad IndexReport when the call to s.Store.IndexReport fails. Even if the Controller's standard error handling code was used for one of these errors then it would result in an IndexReport being persisted with state IndexError which again isn't what is wanted. I wonder if a check should be added to checkManifest to actually make sure the loaded IndexReport is valid, something like:
if sr.State != IndexFinished {
zlog.Info(ctx).Msg("IndexReport not valid so manifest to be scanned")
// Potentially do the filtering that happens in the case where a Manifest has not been scanned, but in this case there will be no scanners left as we know if we get this far `ManifestScanned` returned `ok`
s.Vscnrs = indexer.VersionedScanners{}
return FetchLayers, nil
}