-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix: workflow-controller to identify outdated workflow status. Fixes #13986 #14532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: workflow-controller to identify outdated workflow status. Fixes #13986 #14532
Conversation
c5ba7de to
a64a130
Compare
ccf7d30 to
0cfbce4
Compare
3219bc8 to
4c5f85d
Compare
…rgoproj#13986 Signed-off-by: cyyang <[email protected]>
4c5f85d to
0df401c
Compare
|
Hi @isubasinghe, would you have time to review my changes? |
You mean multiple pods were created for the same node? |
Yes. More specifically, a workflow step may be executed multiple times, which can result in its pod being created multiple times. From the timeline of #13986 (comment): Step A was executed twice — once before step B and once after. Even though step B should only be triggered after step A has succeeded. |
|
Ah okay thanks, I think task results aren't the solution here actually, sure we might be able to stop pods from being created but we still need to address the issue of the workflow going backwards. |
|
The change isn’t related to task results. I may have misunderstood your point — could you clarify what you mean? From our investigation, we’ve already identified that the issue is caused by informer delays. When a delay occurs, an outdated workflow state may be re-processed, which leads to the problem. |
|
Questions we need to answer:
On the Update handler, can we not just ignore these old generation numbers ? Isn't this the cleanest fix? |
Yes apologies i initially thought this could be resolved with task results, it sort of can but doesnt solve the whole problem. |
Regarding the idea of having the update handler ignore old generation numbers, I had a similar thought earlier. However, from the diagram, since it is invoked after the data has already been stored, it doesn’t help in this case. In addition, if the informer delay is caused by a busy API server (which is what we’re seeing), forcing a re-invalidation could actually make things worse. |
|
Hi @isubasinghe, let me know if I misunderstood anything or if you need more info to keep the review moving. Thanks! |
|
I generally think that adding a cache infront of a cache (indirectly) seems like a bad idea. Are we certain that we cannot use the update handler or filter func to reject/override workflows that are out of date replacing a workflow that is newer? I haven't had time to explore a solution yet, but I really don't want to add a cache in front of a cache. |
|
you could use the update/filterfunc to reject processing of outdated workflows based on resource version. However, there are a few issues:
Issue #13986 states that the workflow controller throws multiple 409 errors:
this log message means controller should run reapplyUpdate(): wf, err := wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
woc.log.WithField("error", err).WithField("reason", apierr.ReasonForError(err)).Warn(ctx, "Error updating workflow")
if argokubeerr.IsRequestEntityTooLargeErr(err) {
woc.persistWorkflowSizeLimitErr(ctx, wfClient, err)
return
}
if !apierr.IsConflict(err) {
return
}
woc.log.Info(ctx, "Re-applying updates on latest version and retrying update")
wf, err := woc.reapplyUpdate(ctx, wfClient, nodes)
if err != nil {
woc.log.WithError(err).Info(ctx, "Failed to re-apply update")
return
}
woc.wf = wf
} and reapplyUpdate() gets the workflow from API Server directly, so it should have most up to date data. The Update should not change a finished node to running again: currWf, err := wfClient.Get(ctx, woc.wf.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
if currWf.Status.Fulfilled() {
return nil, fmt.Errorf("must never update completed workflows")
}
err = woc.controller.hydrator.Hydrate(ctx, currWf)
if err != nil {
return nil, err
}
for id, node := range woc.wf.Status.Nodes {
currNode, err := currWf.Status.Nodes.Get(id)
if (err == nil) && currNode.Fulfilled() && node.Phase != currNode.Phase {
return nil, fmt.Errorf("must never update completed node %s", id)
}
}maybe func (n NodeStatus) Fulfilled() bool {
synced := true
if n.TaskResultSynced != nil {
synced = *n.TaskResultSynced
}
return n.Phase.Fulfilled(n.TaskResultSynced) && synced || n.IsDaemoned() && n.Phase != NodePending
}@cyyangnb can you reproduce the issue and provide the full controller log for one workflow ? |
Thanks for the feedback. To avoid treating resourceVersion numerically, I think we would need an additional cache to track it. I assume you might prefer keeping this logic closer to the informer itself. The update handler is triggered after the cache has already been updated, so another possible approach is to use the informer’s transform function That said, binding the fix to the informer means users must enable INFORMER_WRITE_BACK; otherwise, the controller could still process outdated versions. However, I actually prefer not to turn on the INFORMER_WRITE_BACK after the fix. In my testing, when the API server is busy, enabling INFORMER_WRITE_BACK actually worsens the delay. I also think that if informer delays are considered normal, the controller should instead wait for the informer to catch up, rather than writing back to the informer—which introduces two possible sources of updates to the cache and may lead to race conditions or inconsistent state. (The last point might be more of a personal preference.) |
Thanks for the reply. Although a 409 conflict prevents the state from changing a finished node back to running (so the Kubernetes status remains correct), the action for the outdated version can still be executed before the 409 error is thrown. I didn’t keep the detailed logs, except for the ones attached in this issue. After applying the change in this PR for several months, we haven’t encountered this issue again. |
|
@cyyangnb But yes, the real fix I think is for the controller to read its own writes. |
Ah, I see. I previously thought that updating the Another thought is that using Thanks for the feedback. Let me know if you expect me to do any change to move this PR forward. |
|
@isubasinghe - Following up on our earlier discussion — any update on the review? |
|
@cyyangnb I've had a chat with @Joibel and @eduardodbr and we've decided we don't want to proceed with this path until we understand how and why this issue occurs in the first place. After looking at the reflector, the informer writeback can almost certainly cause this. What I don't understand is how a small requeue time can cause this. |
|
Hi @isubasinghe, @Joibel, @eduardodbr — I think the root cause is the same reason you originally added the informer writeback mechanism: there’s always a delay between when the workflow-controller updates the status in Kubernetes and when the informer cache gets updated. Normally (when informer writeback is off), a workflow goes through something like this: If the controller processes the workflow between the version being generated and cached (during The informer writeback may shorten This PR aims to ensure the workflow isn’t processed during Two implicit ideas behind this change are:
I noticed that PR #14949 attempts to remove the informer writeback entirely. However, our approach differs regarding the second point. |
|
Hi @isubasinghe, |
I agree with your point of view. This issue is very important, especially in large-scale scenarios. Our scenario involves more than 20,000 workflows running in parallel. |
|
Hi @shuangkun Thank you for acknowledging that this issue is important. The reason I’m asking is that, in our environment, even with around 1,000 workflows and the default 10-second requeue interval, we’ve observed significant pressure on the Kubernetes API server. This has resulted in noticeable delays and performance degradation. I would like to understand whether your cluster experiences similar behavior at higher workflow volumes, or if your environment is able to handle this scale without issues. This information will help us determine whether our cluster might have an unexpected bottleneck that we need to investigate and resolve. If possible, could you also share some details about your cluster environment and its tier/size? Thank you! |
I did not modify the DEFAULT_REQUEUE_TIME parameter. |
|
Hi @shuangkun In our setup, each individual workflow has roughly 200 nodes (with around 40 pods), but compared to your scenario, it feels like there may still be areas in our environment that could be further optimized. Otherwise, the gap between our results and yours seems unexpectedly large. We'll keep working on the investigation, and really appreciate to have more information from you! |

Fixes #13986
Motivation
We continuous encountered pod duplicate issue because workflow-controller can create duplicate pods due to informer delays.
Modifications
Add a cache to store the latest workflow versions that are recently used but outdated. When the informer is delayed, the controller can use the cache to determine whether the workflow status from the informer is outdated, and skip the action for now if it is.
See the comment for more background:
#13986 (comment)
Verification
For the current version, I only do happy path test with a hello-world workflow:
However, I did a full test based on v3.6.2, which is the version we used in our production environment.
I created a workflow-template massive-40 with 40 sequential steps massive-40.txt. Each step has num of parallel nodes defined by the workflow.
Each node inserts a record to postgres DB. The last step then counts the number of records to see whether it matches the expectation. For example, the following workflow:
The last step expects to see 40 (sequential steps) * 10 (nodes per step) = 400 records. Otherwise the workflow would be failed.
I submit 50 such workflows in my cluster with PROCESSED_WORKFLOW_VERSIONS_TTL = 10m. Before the change (v3.6.2), all workflows were failed. After the change, all workflows were succeeded.
Here're the metrics comparison: (the left one is “before fix”, the right one is “after fix”)
Documentation
Added description for PROCESSED_WORKFLOW_VERSIONS_TTL.