Skip to content

Conversation

@guanguxiansheng
Copy link
Contributor

Fixes #14791

Motivation

Try to fix #14791
Some performance issues were encountered during production. Controller processing performance degrades significantly when running workflows at scale. Adding a throttling queue doesn't significantly improve speed.

Modifications

Modify DeleteFunc in controller.go
Use Index data instead of directly requesting Apiserver to improve processing speed

DeleteFunc: func(obj interface{}) {
					//pods := wfc.kubeclientset.CoreV1().Pods(wfc.GetManagedNamespace())
					//podList, err := pods.List(ctx, metav1.ListOptions{
					//	LabelSelector: fmt.Sprintf("%s=%s", common.LabelKeyWorkflow, obj.(*unstructured.Unstructured).GetName()),
					//})
					//if err != nil {
					//	logger.WithError(err).Error(ctx, "Failed to list pods")
					//}
					//for _, p := range podList.Items {
					//	if slices.Contains(p.Finalizers, common.FinalizerPodStatus) {
					//		wfc.PodController.RemoveFinalizer(ctx, p.Namespace, p.Name)
					//	}
					//}

					wf := obj.(*unstructured.Unstructured)
					podObjs, err := wfc.PodController.GetPodsByIndex(indexes.WorkflowIndex, indexes.WorkflowIndexValue(wf.GetNamespace(), wf.GetName()))
					if err != nil {
						logger.WithError(err).Error(ctx, "Failed to get pods by index")
					}
					for _, podObj := range podObjs {
						p, ok := podObj.(*apiv1.Pod)
						if !ok {
							logger.WithError(err).Error(ctx, "expected \"*apiv1.Pod\", got "+reflect.TypeOf(podObj).String())
							continue
						}
						if slices.Contains(p.Finalizers, common.FinalizerPodStatus) {
							wfc.PodController.RemoveFinalizer(ctx, p.Namespace, p.Name)
						}
					}
					key, err := cache.DeletionHandlingMetaNamespaceKeyFunc(obj)
					if err == nil {
						wfc.releaseAllWorkflowLocks(ctx, obj)
						wfc.recordCompletedWorkflow(key)
						wfc.throttler.Remove(key)
					}
				}

Documentation

In shared_informer.go, a single coroutine traverses nextCh and executes AddFunc, UpdateFunc, and DeleteFunc. In DeleteFunc, if the pods.List request is slow, then the next element of nextCh will be delayed, which will naturally block AddFunc and UpdateFunc.

func (p *processorListener) run() {
	sleepAfterCrash := false
	for next := range p.nextCh {
		if sleepAfterCrash {
			time.Sleep(time.Second)
		}
		func() {
			sleepAfterCrash = true
			defer utilruntime.HandleCrashWithLogger(p.logger)

			switch notification := next.(type) {
			case updateNotification:
				p.handler.OnUpdate(notification.oldObj, notification.newObj)
			case addNotification:
				p.handler.OnAdd(notification.newObj, notification.isInInitialList)
				if notification.isInInitialList {
					p.syncTracker.Finished()
				}
			case deleteNotification:
				p.handler.OnDelete(notification.oldObj)
			default:
				utilruntime.HandleErrorWithLogger(p.logger, nil, "unrecognized notification", "notificationType", fmt.Sprintf("%T", next))
			}
			sleepAfterCrash = false
		}()
	}
}

@siwet
Copy link
Contributor

siwet commented Sep 21, 2025

@guanguxiansheng We encountered similar performance issues when migrating from version 3.5 to 3.6, and referencing this transformation resolved the problem for us as well. It looks like this deletion operation introduces significant overhead and slows down the event handler loop.
cc @shuangkun Could you please help review this PR?

@shuangkun
Copy link
Member

Indeed, this is a good discovery. In the design of the infromer, any time-consuming operation should not be in the event handle, but in the workqueue.

@shuangkun
Copy link
Member

Getting pods from the informer does speed things up, but there's a problem: under high-load scenarios, the local informer might become inconsistent with the remote API server. This operation can't tolerate this inconsistency((For example, during reconceil, it's possible to tolerate a late update of the pod informer.), potentially leading to pod leaks due to undeleting finalizers.
I think the best approach is to separate this operation. For example, define a RemoveBatchPodsFanilizer action and handle it in the podcleanqueue. What do you think about it? @Joibel @isubasinghe

Comment on lines 984 to 989
wf := obj.(*unstructured.Unstructured)
podObjs, err := wfc.PodController.GetPodsByIndex(indexes.WorkflowIndex, indexes.WorkflowIndexValue(wf.GetNamespace(), wf.GetName()))
if err != nil {
logger.WithError(err).Error(ctx, "Failed to list pods")
logger.WithError(err).Error(ctx, "Failed to get pods by index")
}
for _, p := range podList.Items {
for _, podObj := range podObjs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, should we maybe also add a backup in case it isn't in the index?

@guanguxiansheng
Copy link
Contributor Author

guanguxiansheng commented Sep 26, 2025

To solve the problem of incomplete index data, there are two ideas: 1. In the Workflow EventHandler DeleteFunc, extract the Pod name from the workflow status and then use PodController.RemoveFinalizer; 2. Use PodController.RemoveFinalizer directly in the Pod EventHandler DeleteFunc . What do you think about it? @shuangkun @Joibel @isubasinghe @siwet

@shuangkun
Copy link
Member

To solve the problem of incomplete index data, there are two ideas: 1. In the Workflow EventHandler DeleteFunc, extract the Pod name from the workflow status and then use PodController.RemoveFinalizer; 2. Use PodController.RemoveFinalizer directly in the Pod EventHandler DeleteFunc . What do you think about it? @shuangkun @Joibel @isubasinghe @siwet

I prefer to move this loop outside of the event handler, to the workqueue. However, I think this approach is also acceptable, as large and multiple workflows generally do not coexist in a cluster.

@coderabbitai
Copy link

coderabbitai bot commented Dec 13, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@guanguxiansheng
Copy link
Contributor Author

@shuangkun @isubasinghe hi, I chose to get pod information from wf.status.nodes, then the finalizer is asynchronously deleted from the queue. I noticed that the write-back function is disabled by default(#15079), the informer retrieves the deletion event, the latest workflow status contains information about all pods. We've enabled finalizers in our production environment, and 20,000 workflows are running normally in real time.

guanguxiansheng and others added 20 commits December 16, 2025 00:00
argoproj#14908)

Signed-off-by: shuangkun <[email protected]>
Signed-off-by: Sebastien Dionne <[email protected]>
Signed-off-by: Tianchu Zhao <[email protected]>
Co-authored-by: shuangkun tian <[email protected]>
Co-authored-by: lons <[email protected]>
Co-authored-by: Tianchu Zhao <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Joibel <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Joibel and others added 30 commits December 16, 2025 00:01
…roup across 1 directory (argoproj#15085)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Co-authored-by: Elliot Gunton <[email protected]>
Co-authored-by: Mason Malone <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Claude <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Joibel <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Joibel <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Joibel <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add throttle queue to speed up processing