-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix: workflow controller to detect stale workflows #15090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Eduardo Rodrigues <[email protected]>
Signed-off-by: Eduardo Rodrigues <[email protected]>
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a mechanism to detect and skip processing of stale workflow versions in the workflow controller, addressing multiple issues where the controller processes outdated versions of workflows. The implementation uses a combination of a workflow annotation and an in-memory map to track the last processed resource version for each workflow.
Key Changes:
- Added
last-seen-versionannotation and in-memory tracking to identify stale workflow events - Integrated stale detection check (
isOutdated) in the workflow processing pipeline - Cleanup of tracking data when workflows complete or are deleted
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
workflow/common/common.go |
Defines the new AnnotationKeyLastSeenVersion constant for storing the last seen resource version |
workflow/controller/controller.go |
Adds lastSeenVersions struct and tracking logic, implements isOutdated check in processing pipeline, and cleanup on workflow completion/deletion |
workflow/controller/operator.go |
Updates persistUpdates and persistWorkflowSizeLimitErr to set the annotation and update in-memory tracking after successful workflow updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
Signed-off-by: Eduardo Rodrigues <[email protected]>
…sistWorkflowSizeLimitErr Signed-off-by: Eduardo Rodrigues <[email protected]>
Signed-off-by: Eduardo Rodrigues <[email protected]>
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis pull request adds last-seen resource version tracking to the workflow controller. A new annotation constant is introduced, the controller caches last-observed resource versions per workflow to detect outdated events, and the operator updates both the annotation and cache during workflow updates to maintain state consistency. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
workflow/common/common.go (1)
60-61: Clarify what “version” refers to in the new annotation commentTo reduce ambiguity, consider explicitly stating that this stores the workflow
metadata.resourceVersionlast successfully processed by the controller and is used with the in‑memory cache to skip stale informer events. The implementation itself looks correct.workflow/controller/operator.go (1)
764-766: Last‑seen version bookkeeping around persist paths looks consistent; worth hardening with testsUsing
oldRV := woc.wf.ResourceVersionas the value for bothupdateLastSeenVersionAnnotationandupdateLastSeenVersionensures that a workflow is only considered “up‑to‑date” when the informer cache has observed the controller’s last successful update (annotation matches cached value). The conflict (reapplyUpdate) and size‑limit error paths also keep annotation and cache in sync only after a successfulUpdate, which is the right behavior.Given how central this is to skipping stale reconciliations, I’d strongly recommend adding focused unit tests that exercise at least:
- Successful
persistUpdates(no conflict),- Conflict + successful
reapplyUpdate,- Request‑entity‑too‑large path via
persistWorkflowSizeLimitErr,isOutdatedon objects whose annotations are (a) behind the cache, (b) equal to the cache, and (c) missing.That would make the intended semantics much easier to maintain.
Also applies to: 789-789, 866-873
🧹 Nitpick comments (2)
workflow/controller/controller.go (2)
85-88:lastSeenVersionscache design is solid; consider behavior when users edit the annotationThe UID‑keyed
lastSeenVersionswith an RWMutex looks correct, andisOutdated’s “process only when annotation == cached value, or cache miss” rule matches the PR description and avoids acting on informer state that hasn’t observed the controller’s last successful update.One subtle edge case: if someone manually edits or removes
workflow.argoproj.io/last-seen-versionon a running Workflow,isOutdatedwill keep returning true (annotation ≠ cached value) and the controller will never reconcile that workflow again, because the cache entry is only updated from withinpersistUpdates/persistWorkflowSizeLimitErr, which are gated behindisOutdated.If that is an acceptable “don’t touch controller‑owned annotation” contract, it might be worth:
- Documenting this annotation as controller‑owned, and/or
- Logging at warn level when we detect a persistent mismatch, or resetting the cache entry when the annotation is missing.
Otherwise, we may want a small escape hatch so an accidentally edited annotation doesn’t permanently wedge a workflow.
Also applies to: 162-163, 215-218, 1369-1379
738-742: Outdated‑workflow requeue behavior is correct; consider adding observabilityRe‑queuing with
AddRateLimitedwhenisOutdatedis true is a reasonable way to wait for the informer cache to catch up before reconciling.Given this path indicates potential cache staleness, you might consider incrementing a metric or counter when we skip an outdated event so operators can spot clusters where this happens frequently and diagnose underlying watch/cache issues.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
workflow/common/common.go(1 hunks)workflow/controller/controller.go(7 hunks)workflow/controller/operator.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
workflow/controller/controller.go (1)
workflow/common/common.go (1)
AnnotationKeyLastSeenVersion(61-61)
workflow/controller/operator.go (2)
util/logging/logging.go (1)
Warn(59-59)workflow/common/common.go (1)
AnnotationKeyLastSeenVersion(61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Unit Tests
- GitHub Check: Windows Unit Tests
- GitHub Check: Lint
- GitHub Check: argo-images (argoexec)
- GitHub Check: argo-images (argocli)
🔇 Additional comments (2)
workflow/controller/operator.go (1)
4405-4419: Thread‑safety and scoping of last‑seen helpers look goodBoth
updateLastSeenVersionAnnotationandupdateLastSeenVersionare nicely localized towfOperationCtxand correctly guard shared state (lastSeenVersions.versions) with a mutex. Keying bywoc.controller.getLastSeenVersionKey(woc.wf)(UID) ensures stability across renames/resubmissions of the same workflow.workflow/controller/controller.go (1)
966-967: CleaninglastSeenVersionson completion and delete avoids cache leaksCalling
deleteLastSeenVersionKeyboth when reconciliation is no longer needed (completed workflows with no GC finalizer) and on delete events properly cleans up the in‑memory cache and ensures old UIDs don’t accumulate or interfere with resubmitted workflows.Also applies to: 1024-1025
Motivation
Multiple issues have been created because of unexpected workflow behavior:
#13986
#14833
#12352
#14780
It appears that many of these issues occur because the controller is processing an outdated version of the workflow. The exact cause of these stale reads is still unknown, but there is some suspicion that it may be related to the informer write-back mechanism, which is being disabled by default in #15079.
This PR ensures that stale workflow versions are not reconciled by keeping track of the last processed resource version for each workflow in a last-seen-version annotation. A workflow is only processed when its annotation matches the expected version; otherwise, it is re-queued. The annotation stores the workflow’s resource version, though any unique value would work. I just thought using the RV was enough.
Modifications
last-seen-versionannotation, updated with the current resource version on everyUpdate()event.Deleteevent is received or when the workflow completes.Verification
Executed workflows with success.
Documentation
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.