Skip to content

Conversation

@eduardodbr
Copy link
Member

@eduardodbr eduardodbr commented Nov 28, 2025

Motivation

Multiple issues have been created because of unexpected workflow behavior:

#13986
#14833
#12352
#14780

It appears that many of these issues occur because the controller is processing an outdated version of the workflow. The exact cause of these stale reads is still unknown, but there is some suspicion that it may be related to the informer write-back mechanism, which is being disabled by default in #15079.
This PR ensures that stale workflow versions are not reconciled by keeping track of the last processed resource version for each workflow in a last-seen-version annotation. A workflow is only processed when its annotation matches the expected version; otherwise, it is re-queued. The annotation stores the workflow’s resource version, though any unique value would work. I just thought using the RV was enough.

Modifications

  • Introduce a new last-seen-version annotation, updated with the current resource version on every Update() event.
  • Store the last-seen-version of each workflow in memory. When a workflow is processed, it proceeds only if the annotation matches the stored version.
  • If no stored version exists (e.g., after a controller restart), the workflow is always processed to allow normal recovery.
  • The in-memory entry is removed as soon as a Delete event is received or when the workflow completes.

Verification

Executed workflows with success.

Documentation

Summary by CodeRabbit

  • New Features
    • Workflow version tracking now maintains the last observed resource version per workflow
    • System skips processing outdated workflows to optimize reconciliation performance
    • Added caching layer for efficient version state management across workflow events

✏️ Tip: You can customize this high-level summary in your review settings.

@eduardodbr
Copy link
Member Author

/retest

@eduardodbr eduardodbr marked this pull request as ready for review November 30, 2025 19:46
@Joibel Joibel requested a review from Copilot December 1, 2025 11:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a mechanism to detect and skip processing of stale workflow versions in the workflow controller, addressing multiple issues where the controller processes outdated versions of workflows. The implementation uses a combination of a workflow annotation and an in-memory map to track the last processed resource version for each workflow.

Key Changes:

  • Added last-seen-version annotation and in-memory tracking to identify stale workflow events
  • Integrated stale detection check (isOutdated) in the workflow processing pipeline
  • Cleanup of tracking data when workflows complete or are deleted

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
workflow/common/common.go Defines the new AnnotationKeyLastSeenVersion constant for storing the last seen resource version
workflow/controller/controller.go Adds lastSeenVersions struct and tracking logic, implements isOutdated check in processing pipeline, and cleanup on workflow completion/deletion
workflow/controller/operator.go Updates persistUpdates and persistWorkflowSizeLimitErr to set the annotation and update in-memory tracking after successful workflow updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

…sistWorkflowSizeLimitErr

Signed-off-by: Eduardo Rodrigues <[email protected]>
Signed-off-by: Eduardo Rodrigues <[email protected]>
@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request adds last-seen resource version tracking to the workflow controller. A new annotation constant is introduced, the controller caches last-observed resource versions per workflow to detect outdated events, and the operator updates both the annotation and cache during workflow updates to maintain state consistency.

Changes

Cohort / File(s) Summary
Annotation constant
workflow/common/common.go
Adds exported constant AnnotationKeyLastSeenVersion to define the annotation key for storing the last seen workflow version.
Controller state tracking and validation
workflow/controller/controller.go
Adds lastSeenVersions map field to cache last-observed resource versions per workflow UID, protected by mutex. Initializes cache in NewWorkflowController. Adds isOutdated() method to check if a workflow's stored version differs from its annotation, getLastSeenVersionKey() to derive cache keys, and deleteLastSeenVersionKey() to purge entries. Modifies processNextItem() to skip outdated workflows. Updates informer handlers (AddFunc, UpdateFunc, DeleteFunc) to manage cache lifecycle when workflows are completed or deleted.
Operator annotation and cache updates
workflow/controller/operator.go
Adds updateLastSeenVersionAnnotation() and updateLastSeenVersion() helper methods to wfOperationCtx. Integrates version tracking into persistUpdates() and persistWorkflowSizeLimitErr() by capturing old resource version, writing it to annotations, and syncing to the in-memory cache after successful updates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Thread safety: Review mutex usage and lock placement in lastSeenVersions access patterns across AddFunc, UpdateFunc, and DeleteFunc handlers
  • State consistency: Verify cache entries are properly purged when workflows are deleted or completed, and that annotation/cache synchronization in operator methods maintains correctness under concurrent updates
  • Control flow: Confirm isOutdated() logic and skip behavior in processNextItem() correctly prevent processing of stale events without losing legitimate updates

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers motivation, modifications, and verification, but lacks issue reference in Fixes field and documentation updates are incomplete. Add 'Fixes #' with at least one referenced issue number and clarify documentation status or impact.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding stale workflow detection to the workflow controller.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@eduardodbr
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/common/common.go (1)

60-61: Clarify what “version” refers to in the new annotation comment

To reduce ambiguity, consider explicitly stating that this stores the workflow metadata.resourceVersion last successfully processed by the controller and is used with the in‑memory cache to skip stale informer events. The implementation itself looks correct.

workflow/controller/operator.go (1)

764-766: Last‑seen version bookkeeping around persist paths looks consistent; worth hardening with tests

Using oldRV := woc.wf.ResourceVersion as the value for both updateLastSeenVersionAnnotation and updateLastSeenVersion ensures that a workflow is only considered “up‑to‑date” when the informer cache has observed the controller’s last successful update (annotation matches cached value). The conflict (reapplyUpdate) and size‑limit error paths also keep annotation and cache in sync only after a successful Update, which is the right behavior.

Given how central this is to skipping stale reconciliations, I’d strongly recommend adding focused unit tests that exercise at least:

  • Successful persistUpdates (no conflict),
  • Conflict + successful reapplyUpdate,
  • Request‑entity‑too‑large path via persistWorkflowSizeLimitErr,
  • isOutdated on objects whose annotations are (a) behind the cache, (b) equal to the cache, and (c) missing.

That would make the intended semantics much easier to maintain.

Also applies to: 789-789, 866-873

🧹 Nitpick comments (2)
workflow/controller/controller.go (2)

85-88: lastSeenVersions cache design is solid; consider behavior when users edit the annotation

The UID‑keyed lastSeenVersions with an RWMutex looks correct, and isOutdated’s “process only when annotation == cached value, or cache miss” rule matches the PR description and avoids acting on informer state that hasn’t observed the controller’s last successful update.

One subtle edge case: if someone manually edits or removes workflow.argoproj.io/last-seen-version on a running Workflow, isOutdated will keep returning true (annotation ≠ cached value) and the controller will never reconcile that workflow again, because the cache entry is only updated from within persistUpdates / persistWorkflowSizeLimitErr, which are gated behind isOutdated.

If that is an acceptable “don’t touch controller‑owned annotation” contract, it might be worth:

  • Documenting this annotation as controller‑owned, and/or
  • Logging at warn level when we detect a persistent mismatch, or resetting the cache entry when the annotation is missing.

Otherwise, we may want a small escape hatch so an accidentally edited annotation doesn’t permanently wedge a workflow.

Also applies to: 162-163, 215-218, 1369-1379


738-742: Outdated‑workflow requeue behavior is correct; consider adding observability

Re‑queuing with AddRateLimited when isOutdated is true is a reasonable way to wait for the informer cache to catch up before reconciling.

Given this path indicates potential cache staleness, you might consider incrementing a metric or counter when we skip an outdated event so operators can spot clusters where this happens frequently and diagnose underlying watch/cache issues.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9242c6f and ec0397e.

📒 Files selected for processing (3)
  • workflow/common/common.go (1 hunks)
  • workflow/controller/controller.go (7 hunks)
  • workflow/controller/operator.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
workflow/controller/controller.go (1)
workflow/common/common.go (1)
  • AnnotationKeyLastSeenVersion (61-61)
workflow/controller/operator.go (2)
util/logging/logging.go (1)
  • Warn (59-59)
workflow/common/common.go (1)
  • AnnotationKeyLastSeenVersion (61-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Unit Tests
  • GitHub Check: Windows Unit Tests
  • GitHub Check: Lint
  • GitHub Check: argo-images (argoexec)
  • GitHub Check: argo-images (argocli)
🔇 Additional comments (2)
workflow/controller/operator.go (1)

4405-4419: Thread‑safety and scoping of last‑seen helpers look good

Both updateLastSeenVersionAnnotation and updateLastSeenVersion are nicely localized to wfOperationCtx and correctly guard shared state (lastSeenVersions.versions) with a mutex. Keying by woc.controller.getLastSeenVersionKey(woc.wf) (UID) ensures stability across renames/resubmissions of the same workflow.

workflow/controller/controller.go (1)

966-967: Cleaning lastSeenVersions on completion and delete avoids cache leaks

Calling deleteLastSeenVersionKey both when reconciliation is no longer needed (completed workflows with no GC finalizer) and on delete events properly cleans up the in‑memory cache and ensures old UIDs don’t accumulate or interfere with resubmitted workflows.

Also applies to: 1024-1025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant