Skip to content

[Bug] RayCluster leaks in RayJob when the cluster terminated exceptionally #3860

@anxietymonger

Description

@anxietymonger

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Using RayJob with kuberay v1.3.2, when the pod of head node of the RayCluster terminated exceptionally (e.g. OOMKilled, evicted), the cluster got restarted during reconciliation (but losing the submitted ray jobs I presume). During the process, the submitter pod somehow goes into Complete state, leading to a resource leak where the RayCluster and RayJob will continue to be "Running" indefinitely even though there is no ray job at all in the cluster.

RayJob parameters:
backoffLimit = 0, submitterConfig.backoffLimit = 0, restartPolicy = Never

# kubectl get rayjob
NAME                                   JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                                        START TIME             END TIME   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23   RUNNING      Running             2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh   2025-07-24T08:23:57Z              19h
78ff5104-413f-4208-b4d6-7d894a6f17f8   RUNNING      Running             78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4   2025-07-24T08:24:15Z              19h
a2e91677-229a-49f2-8e2b-0b59d513d694   RUNNING      Running             a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2   2025-07-24T08:24:20Z              19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271   RUNNING      Running             a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4   2025-07-24T08:24:10Z              19h
c3156a39-625a-4201-9b04-5030b84c6d8d   RUNNING      Running             c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z   2025-07-24T08:24:17Z              19h
d0571c80-7162-40e4-a0c5-617cd2d11b54   RUNNING      Running             d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9   2025-07-24T08:24:12Z              19h
# kubectl get raycluster
NAME                                                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh                                         55     500Gi    0      ready    19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4                                         55     500Gi    0      ready    19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2                                         55     500Gi    0      ready    19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4                                         55     500Gi    0      ready    19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z                                         55     500Gi    0      ready    19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9                                         55     500Gi    0      ready    19h
# kubectl get pod
NAME                                                            READY   STATUS      RESTARTS   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-gszzs                      0/1     Completed   0          19h
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8h-head-44hhb   1/1     Running     0          19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg-head-6kf6z   1/1     Running     0          19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-xb9k9                      0/1     Completed   0          19h
a2e91677-229a-49f2-8e2b-0b59d513d694-k49pd                      0/1     Completed   0          19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q-head-llhvf   1/1     Running     0          19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-27n5b                      0/1     Completed   0          19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vf-head-7rltl   1/1     Running     0          19h
c3156a39-625a-4201-9b04-5030b84c6d8d-2l5mb                      0/1     Completed   0          19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f6-head-dd6lv   1/1     Running     0          19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-4vsqv                      0/1     Completed   0          19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-td-head-54925   1/1     Running     0          19h

Reproduction script

NA

Anything else

Is this because the reconciler only checks the failed condition? Shouldn't it also check other conditions as well to prevent the leak?

func checkK8sJobAndUpdateStatusIfNeeded(ctx context.Context, rayJob *rayv1.RayJob, job *batchv1.Job) bool {
logger := ctrl.LoggerFrom(ctx)
for _, cond := range job.Status.Conditions {
if cond.Type == batchv1.JobFailed && cond.Status == corev1.ConditionTrue {
logger.Info("The submitter Kubernetes Job has failed. Attempting to transition the status to `Failed`.", "Submitter K8s Job", job.Name, "Reason", cond.Reason, "Message", cond.Message)
rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
// The submitter Job needs to wait for the user code to finish and retrieve its logs.
// Therefore, a failed Submitter Job indicates that the submission itself has failed or the user code has thrown an error.
// If the failure is due to user code, the JobStatus and Job message will be updated accordingly from the previous reconciliation.
if rayJob.Status.JobStatus == rayv1.JobStatusFailed {
rayJob.Status.Reason = rayv1.AppFailed
} else {
rayJob.Status.Reason = rayv1.SubmissionFailed
rayJob.Status.Message = fmt.Sprintf("Job submission has failed. Reason: %s. Message: %s", cond.Reason, cond.Message)
}
return true
}
}
return false
}

Or to update RayJob status when failed to retrieve ray job?

jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId)
if err != nil {
// If the Ray job was not found, GetJobInfo returns a BadRequest error.
if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode && errors.IsBadRequest(err) {
logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId)
if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil {
logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
}
logger.Error(err, "Failed to get job info", "JobId", rayJobInstance.Status.JobId)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Labels

1.5.0P0Critical issue that should be fixed ASAPbugSomething isn't workingrayjob

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions