Skip to content

Conversation

@cyyangnb
Copy link

@cyyangnb cyyangnb commented Jun 3, 2025

Fixes #13986

Motivation

We continuous encountered pod duplicate issue because workflow-controller can create duplicate pods due to informer delays.

Modifications

Add a cache to store the latest workflow versions that are recently used but outdated. When the informer is delayed, the controller can use the cache to determine whether the workflow status from the informer is outdated, and skip the action for now if it is.
See the comment for more background:
#13986 (comment)

Verification

For the current version, I only do happy path test with a hello-world workflow:

$ make clean
$ make start
$ kubectl create -f examples/hello-world.yaml
$ kubectl get workflow
NAME                STATUS      AGE     MESSAGE
hello-world-q7gz6   Succeeded   3m43s

However, I did a full test based on v3.6.2, which is the version we used in our production environment.
I created a workflow-template massive-40 with 40 sequential steps massive-40.txt. Each step has num of parallel nodes defined by the workflow.
Each node inserts a record to postgres DB. The last step then counts the number of records to see whether it matches the expectation. For example, the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  ...
spec:
  arguments:
    parameters:
    - name: nodes
      value: "10"
    - name: sleep
      value: 3s
    - name: batch-id
      value: stress-test-hwmfc
    - name: workflow-id
      value: stress-test-hwmfc-0
  workflowTemplateRef:
    name: massive-40

The last step expects to see 40 (sequential steps) * 10 (nodes per step) = 400 records. Otherwise the workflow would be failed.
I submit 50 such workflows in my cluster with PROCESSED_WORKFLOW_VERSIONS_TTL = 10m. Before the change (v3.6.2), all workflows were failed. After the change, all workflows were succeeded.

Here're the metrics comparison: (the left one is “before fix”, the right one is “after fix”)

  • No failed workflow after fix
    Screenshot 2025-06-09 at 1 41 08 PM
  • No 409 error from workflow-controller
    Screenshot 2025-06-09 at 4 10 24 PM

Documentation

Added description for PROCESSED_WORKFLOW_VERSIONS_TTL.

@cyyangnb cyyangnb force-pushed the fix-13986-outdated-workflow-status branch 6 times, most recently from c5ba7de to a64a130 Compare June 4, 2025 16:23
@cyyangnb cyyangnb marked this pull request as ready for review June 5, 2025 01:56
@cyyangnb cyyangnb marked this pull request as draft June 5, 2025 01:58
@cyyangnb cyyangnb marked this pull request as ready for review June 5, 2025 01:58
@cyyangnb cyyangnb force-pushed the fix-13986-outdated-workflow-status branch 8 times, most recently from ccf7d30 to 0cfbce4 Compare June 11, 2025 06:15
@isubasinghe isubasinghe self-requested a review July 9, 2025 10:11
@cyyangnb cyyangnb force-pushed the fix-13986-outdated-workflow-status branch 3 times, most recently from 3219bc8 to 4c5f85d Compare October 2, 2025 08:02
@cyyangnb cyyangnb force-pushed the fix-13986-outdated-workflow-status branch from 4c5f85d to 0df401c Compare October 2, 2025 08:39
@cyyangnb
Copy link
Author

cyyangnb commented Oct 2, 2025

Hi @isubasinghe, would you have time to review my changes?

@isubasinghe
Copy link
Member

We continuous encountered pod duplicate issue because workflow-controller can create duplicate pods due to informer delays.

You mean multiple pods were created for the same node?

@cyyangnb
Copy link
Author

cyyangnb commented Oct 3, 2025

We continuous encountered pod duplicate issue because workflow-controller can create duplicate pods due to informer delays.

You mean multiple pods were created for the same node?

Yes. More specifically, a workflow step may be executed multiple times, which can result in its pod being created multiple times. From the timeline of #13986 (comment):

A: the step occurred error
B: the step after A

09:35:39.062 -> A init Pending 
09:35:39.073 -> A create pod
09:35:42.109 -> A update to phase Pending (PodInitializing)
09:35:48:943 -> A update to phase Running
09:36:46:040 -> A update to phase Succeeded
09:36:49:260 -> A update to phase Succeeded
09:36:49.279 -> B init Pending
09:36:49.493 -> B create pod
09:36:52.382 -> B update phase to pending (PodInitializing)
09:36:59:301 -> A Pending (PodInitializing)
09:37:02:301 -> A update to phase Running
09:37:05:430 -> A update to phase Running
09:37:08.673 -> A update to phase Failed
09:38:15.793 -> A update to phase Failed
09:38:15.793 -> B update phase to Succeeded

Step A was executed twice — once before step B and once after. Even though step B should only be triggered after step A has succeeded.

@isubasinghe
Copy link
Member

Ah okay thanks, I think task results aren't the solution here actually, sure we might be able to stop pods from being created but we still need to address the issue of the workflow going backwards.

@cyyangnb
Copy link
Author

cyyangnb commented Oct 3, 2025

The change isn’t related to task results. I may have misunderstood your point — could you clarify what you mean? From our investigation, we’ve already identified that the issue is caused by informer delays. When a delay occurs, an outdated workflow state may be re-processed, which leads to the problem.
For more details, please see comment #13986 (comment). The fix in this PR is to skip processing a workflow if its status is outdated and wait until the up-to-date version is received.

@isubasinghe
Copy link
Member

Questions we need to answer:

  1. Will the informer prevent updates where the generation number decreases?
  2. Is this caused by our writeback mechanism?

On the Update handler, can we not just ignore these old generation numbers ? Isn't this the cleanest fix?
This way no work would be actioned on them, we could also then re-invalidate the cache here.

@isubasinghe
Copy link
Member

The change isn’t related to task results. I may have misunderstood your point — could you clarify what you mean? From our investigation, we’ve already identified that the issue is caused by informer delays. When a delay occurs, an outdated workflow state may be re-processed, which leads to the problem.
For more details, please see comment #13986 (comment). The fix in this PR is to skip processing a workflow if its status is outdated and wait until the up-to-date version is received.

Yes apologies i initially thought this could be resolved with task results, it sort of can but doesnt solve the whole problem.

@cyyangnb
Copy link
Author

cyyangnb commented Oct 3, 2025

Questions we need to answer:

  1. Will the informer prevent updates where the generation number decreases?
  2. Is this caused by our writeback mechanism?

On the Update handler, can we not just ignore these old generation numbers ? Isn't this the cleanest fix? This way no work would be actioned on them, we could also then re-invalidate the cache here.

  1. I believe the answer is no. From the evidence, when the issue occurred, the informer had already been updated with the new version by Argo’s writeback mechanism, but was later updated with an older version. I also believe it’s not straightforward to prevent this in informer, especially since the Kubernetes documentation
    explicitly warns against comparing resourceVersion values numerically.
  2. No, this isn’t caused by the writeback mechanism (although may be related). Even without writeback, the issue can still be reproduced — particularly when using a small DEFAULT_REQUEUE_TIME (e.g., 3s in my earlier tests). The problem could occur when a workflow is processed based on an outdated version.

Regarding the idea of having the update handler ignore old generation numbers, I had a similar thought earlier. However, from the diagram, since it is invoked after the data has already been stored, it doesn’t help in this case. In addition, if the informer delay is caused by a busy API server (which is what we’re seeing), forcing a re-invalidation could actually make things worse.

@cyyangnb
Copy link
Author

cyyangnb commented Oct 7, 2025

Hi @isubasinghe, let me know if I misunderstood anything or if you need more info to keep the review moving. Thanks!

@isubasinghe
Copy link
Member

isubasinghe commented Oct 7, 2025

I generally think that adding a cache infront of a cache (indirectly) seems like a bad idea.

Are we certain that we cannot use the update handler or filter func to reject/override workflows that are out of date replacing a workflow that is newer?

I haven't had time to explore a solution yet, but I really don't want to add a cache in front of a cache.

@eduardodbr
Copy link
Member

you could use the update/filterfunc to reject processing of outdated workflows based on resource version. However, there are a few issues:

  1. kubernetes docs recommends not doing it
  2. informer cache is updated before the event handler triggers. Which means informer cache will be wrong. This is probably an eventual consistency issue, because eventually the most up-to-date workflow will be processed. I dont know if there are other issues that might persist even with that change.

Issue #13986 states that the workflow controller throws multiple 409 errors:

Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \\\""my-another-workflow0vcmln\\\"": the object has been modified; please apply your changes to the latest version and try again Conflict

this log message means controller should run reapplyUpdate():

wf, err := wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
  woc.log.WithField("error", err).WithField("reason", apierr.ReasonForError(err)).Warn(ctx, "Error updating workflow")
  if argokubeerr.IsRequestEntityTooLargeErr(err) {
	  woc.persistWorkflowSizeLimitErr(ctx, wfClient, err)
	  return
  }
  if !apierr.IsConflict(err) {
	  return
  }
  woc.log.Info(ctx, "Re-applying updates on latest version and retrying update")
  wf, err := woc.reapplyUpdate(ctx, wfClient, nodes)
  if err != nil {
	  woc.log.WithError(err).Info(ctx, "Failed to re-apply update")
	  return
  }
  woc.wf = wf
} 

and reapplyUpdate() gets the workflow from API Server directly, so it should have most up to date data. The Update should not change a finished node to running again:

currWf, err := wfClient.Get(ctx, woc.wf.Name, metav1.GetOptions{})
if err != nil {
	return nil, err
}
if currWf.Status.Fulfilled() {
	return nil, fmt.Errorf("must never update completed workflows")
}
err = woc.controller.hydrator.Hydrate(ctx, currWf)
if err != nil {
	return nil, err
}
for id, node := range woc.wf.Status.Nodes {
	currNode, err := currWf.Status.Nodes.Get(id)
	if (err == nil) && currNode.Fulfilled() && node.Phase != currNode.Phase {
		return nil, fmt.Errorf("must never update completed node %s", id)
	}
}

maybe currNode.Fulfilled() is not returnin true although it should? Is is possible that TaskResultSynced is false ?

func (n NodeStatus) Fulfilled() bool {
	synced := true
	if n.TaskResultSynced != nil {
		synced = *n.TaskResultSynced
	}
	return n.Phase.Fulfilled(n.TaskResultSynced) && synced || n.IsDaemoned() && n.Phase != NodePending
}

@cyyangnb can you reproduce the issue and provide the full controller log for one workflow ?

@cyyangnb
Copy link
Author

cyyangnb commented Oct 7, 2025

I generally think that adding a cache infront of a cache (indirectly) seems like a bad idea.

Are we certain that we cannot use the update handler or filter func to reject/override workflows that are out of date replacing a workflow that is newer?

I haven't had time to explore a solution yet, but I really don't want to add a cache in front of a cache.

Thanks for the feedback. To avoid treating resourceVersion numerically, I think we would need an additional cache to track it. I assume you might prefer keeping this logic closer to the informer itself. The update handler is triggered after the cache has already been updated, so another possible approach is to use the informer’s transform function
, although this wouldn’t fully satisfy the “idempotent” recommendation.

That said, binding the fix to the informer means users must enable INFORMER_WRITE_BACK; otherwise, the controller could still process outdated versions. However, I actually prefer not to turn on the INFORMER_WRITE_BACK after the fix. In my testing, when the API server is busy, enabling INFORMER_WRITE_BACK actually worsens the delay. I also think that if informer delays are considered normal, the controller should instead wait for the informer to catch up, rather than writing back to the informer—which introduces two possible sources of updates to the cache and may lead to race conditions or inconsistent state. (The last point might be more of a personal preference.)

@cyyangnb
Copy link
Author

cyyangnb commented Oct 7, 2025

you could use the update/filterfunc to reject processing of outdated workflows based on resource version. However, there are a few issues:

  1. kubernetes docs recommends not doing it
  2. informer cache is updated before the event handler triggers. Which means informer cache will be wrong. This is probably an eventual consistency issue, because eventually the most up-to-date workflow will be processed. I dont know if there are other issues that might persist even with that change.

Issue #13986 states that the workflow controller throws multiple 409 errors:

Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \\\""my-another-workflow0vcmln\\\"": the object has been modified; please apply your changes to the latest version and try again Conflict

this log message means controller should run reapplyUpdate():

wf, err := wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
  woc.log.WithField("error", err).WithField("reason", apierr.ReasonForError(err)).Warn(ctx, "Error updating workflow")
  if argokubeerr.IsRequestEntityTooLargeErr(err) {
	  woc.persistWorkflowSizeLimitErr(ctx, wfClient, err)
	  return
  }
  if !apierr.IsConflict(err) {
	  return
  }
  woc.log.Info(ctx, "Re-applying updates on latest version and retrying update")
  wf, err := woc.reapplyUpdate(ctx, wfClient, nodes)
  if err != nil {
	  woc.log.WithError(err).Info(ctx, "Failed to re-apply update")
	  return
  }
  woc.wf = wf
} 

and reapplyUpdate() gets the workflow from API Server directly, so it should have most up to date data. The Update should not change a finished node to running again:

currWf, err := wfClient.Get(ctx, woc.wf.Name, metav1.GetOptions{})
if err != nil {
	return nil, err
}
if currWf.Status.Fulfilled() {
	return nil, fmt.Errorf("must never update completed workflows")
}
err = woc.controller.hydrator.Hydrate(ctx, currWf)
if err != nil {
	return nil, err
}
for id, node := range woc.wf.Status.Nodes {
	currNode, err := currWf.Status.Nodes.Get(id)
	if (err == nil) && currNode.Fulfilled() && node.Phase != currNode.Phase {
		return nil, fmt.Errorf("must never update completed node %s", id)
	}
}

maybe currNode.Fulfilled() is not returnin true although it should? Is is possible that TaskResultSynced is false ?

func (n NodeStatus) Fulfilled() bool {
	synced := true
	if n.TaskResultSynced != nil {
		synced = *n.TaskResultSynced
	}
	return n.Phase.Fulfilled(n.TaskResultSynced) && synced || n.IsDaemoned() && n.Phase != NodePending
}

@cyyangnb can you reproduce the issue and provide the full controller log for one workflow ?

Thanks for the reply. Although a 409 conflict prevents the state from changing a finished node back to running (so the Kubernetes status remains correct), the action for the outdated version can still be executed before the 409 error is thrown. I didn’t keep the detailed logs, except for the ones attached in this issue. After applying the change in this PR for several months, we haven’t encountered this issue again.

@isubasinghe
Copy link
Member

@cyyangnb
You can use the generationNumber instead of the resourceVersion, this is strictly increasing.

But yes, the real fix I think is for the controller to read its own writes.

@cyyangnb
Copy link
Author

cyyangnb commented Oct 9, 2025

You can use the generationNumber instead of the resourceVersion, this is strictly increasing.

Ah, I see. I previously thought that updating the status field would not affect the generation number. However, according to this reference, this behavior actually depends on whether spec.versions[*].subresources.status is defined in the CRD.

Another thought is that using resourceVersion might still be safer, since changes to the CRD definition won’t alter its semantics. I’m open to either option though.

Thanks for the feedback. Let me know if you expect me to do any change to move this PR forward.

@cyyangnb
Copy link
Author

@isubasinghe - Following up on our earlier discussion — any update on the review?

@isubasinghe
Copy link
Member

isubasinghe commented Oct 23, 2025

@cyyangnb I've had a chat with @Joibel and @eduardodbr and we've decided we don't want to proceed with this path until we understand how and why this issue occurs in the first place.

After looking at the reflector, the informer writeback can almost certainly cause this.
The reflector maintains the last resource version, but when we do a writeback we do not update the reflector (we cannot).
This allows older workflows to sneak into the cache/informer.

What I don't understand is how a small requeue time can cause this.

@cyyangnb
Copy link
Author

cyyangnb commented Oct 24, 2025

Hi @isubasinghe, @Joibel, @eduardodbr

I think the root cause is the same reason you originally added the informer writeback mechanism: there’s always a delay between when the workflow-controller updates the status in Kubernetes and when the informer cache gets updated.

Normally (when informer writeback is off), a workflow goes through something like this:

|------------t1------------|------------t2------------|------------t3------------|
v1 generated     v1 cached through reflector     v2 generated     v2 cached through reflector

If the controller processes the workflow between the version being generated and cached (during t1 or t3), it ends up operating on a stale version — causing the issue. A shorter requeue interval increases the likelihood that the workflow will be processed before the informer cache catches up.

The informer writeback may shorten t1 and t3, but it doesn’t eliminate them completely. As you’ve pointed out, since the cached version can still be reverted through the reflector path, this can’t fully solve the problem.

This PR aims to ensure the workflow isn’t processed during t1 or t3.

Two implicit ideas behind this change are:

  1. I think it’s better to turn off the informer writeback, since it introduces two sources of updates to the informer cache, which can lead to inconsistent or chaotic states.
  2. In our case, the informer delay mainly comes from a busy Kubernetes API server. In such situations, letting the workflow-controller keep the latest workflow status locally after an update—rather than waiting for the informer to catch up—could actually make things worse.

I noticed that PR #14949 attempts to remove the informer writeback entirely. However, our approach differs regarding the second point.

@cyyangnb
Copy link
Author

cyyangnb commented Nov 6, 2025

Hi @isubasinghe,
Does the comment above answer your question about how a small requeue time can cause this?
I’d also like to know whether there is any plan to fix this issue. I’m totally fine if the final decision is to go with the fix in #14949 or another solution — the important thing for me is that this real problem gets addressed in some way.

@shuangkun
Copy link
Member

Hi @isubasinghe, @Joibel, @eduardodbr

I think the root cause is the same reason you originally added the informer writeback mechanism: there’s always a delay between when the workflow-controller updates the status in Kubernetes and when the informer cache gets updated.

Normally (when informer writeback is off), a workflow goes through something like this:

|------------t1------------|------------t2------------|------------t3------------|
v1 generated     v1 cached through reflector     v2 generated     v2 cached through reflector

If the controller processes the workflow between the version being generated and cached (during t1 or t3), it ends up operating on a stale version — causing the issue. A shorter requeue interval increases the likelihood that the workflow will be processed before the informer cache catches up.

The informer writeback may shorten t1 and t3, but it doesn’t eliminate them completely. As you’ve pointed out, since the cached version can still be reverted through the reflector path, this can’t fully solve the problem.

This PR aims to ensure the workflow isn’t processed during t1 or t3.

Two implicit ideas behind this change are:

  1. I think it’s better to turn off the informer writeback, since it introduces two sources of updates to the informer cache, which can lead to inconsistent or chaotic states.
  2. In our case, the informer delay mainly comes from a busy Kubernetes API server. In such situations, letting the workflow-controller keep the latest workflow status locally after an update—rather than waiting for the informer to catch up—could actually make things worse.

I noticed that PR #14949 attempts to remove the informer writeback entirely. However, our approach differs regarding the second point.

I agree with your point of view. This issue is very important, especially in large-scale scenarios. Our scenario involves more than 20,000 workflows running in parallel.
image
We have encountered many problems because of this caching issue, such as a large number of pods being created twice.
Our Kubernetes control plane maintainers also advised me against writing back to the informer, as this would create two sources of updates. A better approach is to have a separate cache and local informer. Each time, select the resource with the larger resource version from these two caches for processing.
Our method has been online for several month now, and we haven't encountered any other problems.

@carolkao
Copy link

Hi @shuangkun

Thank you for acknowledging that this issue is important.
I noticed in the screenshot you shared that there were around 20,000 workflows running simultaneously. May I ask what value you are using for DEFAULT_REQUEUE_TIME in your environment? Have you increased it from the default 10 seconds?

The reason I’m asking is that, in our environment, even with around 1,000 workflows and the default 10-second requeue interval, we’ve observed significant pressure on the Kubernetes API server. This has resulted in noticeable delays and performance degradation.

I would like to understand whether your cluster experiences similar behavior at higher workflow volumes, or if your environment is able to handle this scale without issues. This information will help us determine whether our cluster might have an unexpected bottleneck that we need to investigate and resolve.

If possible, could you also share some details about your cluster environment and its tier/size? Thank you!

@shuangkun
Copy link
Member

shuangkun commented Dec 11, 2025

Hi @shuangkun

Thank you for acknowledging that this issue is important. I noticed in the screenshot you shared that there were around 20,000 workflows running simultaneously. May I ask what value you are using for DEFAULT_REQUEUE_TIME in your environment? Have you increased it from the default 10 seconds?

The reason I’m asking is that, in our environment, even with around 1,000 workflows and the default 10-second requeue interval, we’ve observed significant pressure on the Kubernetes API server. This has resulted in noticeable delays and performance degradation.

I would like to understand whether your cluster experiences similar behavior at higher workflow volumes, or if your environment is able to handle this scale without issues. This information will help us determine whether our cluster might have an unexpected bottleneck that we need to investigate and resolve.

If possible, could you also share some details about your cluster environment and its tier/size? Thank you!

I did not modify the DEFAULT_REQUEUE_TIME parameter.
My scenario is about 20,000 concurrent workflows, but the number of pods in my individual workflows is small.
I increased the number of concurrent goroutines for workflow processing to 1024.
My optimization directions are:
1)Reduce event processing time and avoid OOM. (#14876, #14875)
2)Cache workflows, reduce conflicts, just like we are doing now.
3)Reduce etcd pressure. (#15115)

@carolkao
Copy link

Hi @shuangkun
Thanks for sharing this information. In our environment, we’ve observed that the bottleneck appears to come from the Kubernetes API server. When the number of concurrent workflows reaches around 2,000, we start to notice delay events. (affecting like pod scheduling, task result updates... and so on)

In our setup, each individual workflow has roughly 200 nodes (with around 40 pods), but compared to your scenario, it feels like there may still be areas in our environment that could be further optimized. Otherwise, the gap between our results and yours seems unexpectedly large.

We'll keep working on the investigation, and really appreciate to have more information from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workflow continues running after error occurred

5 participants