-
Notifications
You must be signed in to change notification settings - Fork 625
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Using RayJob with kuberay v1.3.2, when the pod of head node of the RayCluster terminated exceptionally (e.g. OOMKilled, evicted), the cluster got restarted during reconciliation (but losing the submitted ray jobs I presume). During the process, the submitter pod somehow goes into Complete state, leading to a resource leak where the RayCluster and RayJob will continue to be "Running" indefinitely even though there is no ray job at all in the cluster.
RayJob parameters:
backoffLimit = 0, submitterConfig.backoffLimit = 0, restartPolicy = Never
# kubectl get rayjob
NAME JOB STATUS DEPLOYMENT STATUS RAY CLUSTER NAME START TIME END TIME AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23 RUNNING Running 2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh 2025-07-24T08:23:57Z 19h
78ff5104-413f-4208-b4d6-7d894a6f17f8 RUNNING Running 78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4 2025-07-24T08:24:15Z 19h
a2e91677-229a-49f2-8e2b-0b59d513d694 RUNNING Running a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2 2025-07-24T08:24:20Z 19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271 RUNNING Running a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4 2025-07-24T08:24:10Z 19h
c3156a39-625a-4201-9b04-5030b84c6d8d RUNNING Running c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z 2025-07-24T08:24:17Z 19h
d0571c80-7162-40e4-a0c5-617cd2d11b54 RUNNING Running d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9 2025-07-24T08:24:12Z 19h
# kubectl get raycluster
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh 55 500Gi 0 ready 19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4 55 500Gi 0 ready 19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2 55 500Gi 0 ready 19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4 55 500Gi 0 ready 19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z 55 500Gi 0 ready 19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9 55 500Gi 0 ready 19h
# kubectl get pod
NAME READY STATUS RESTARTS AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-gszzs 0/1 Completed 0 19h
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8h-head-44hhb 1/1 Running 0 19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg-head-6kf6z 1/1 Running 0 19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-xb9k9 0/1 Completed 0 19h
a2e91677-229a-49f2-8e2b-0b59d513d694-k49pd 0/1 Completed 0 19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q-head-llhvf 1/1 Running 0 19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-27n5b 0/1 Completed 0 19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vf-head-7rltl 1/1 Running 0 19h
c3156a39-625a-4201-9b04-5030b84c6d8d-2l5mb 0/1 Completed 0 19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f6-head-dd6lv 1/1 Running 0 19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-4vsqv 0/1 Completed 0 19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-td-head-54925 1/1 Running 0 19h
Reproduction script
NA
Anything else
Is this because the reconciler only checks the failed condition? Shouldn't it also check other conditions as well to prevent the leak?
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 916 to 935 in ea0b9c5
func checkK8sJobAndUpdateStatusIfNeeded(ctx context.Context, rayJob *rayv1.RayJob, job *batchv1.Job) bool { | |
logger := ctrl.LoggerFrom(ctx) | |
for _, cond := range job.Status.Conditions { | |
if cond.Type == batchv1.JobFailed && cond.Status == corev1.ConditionTrue { | |
logger.Info("The submitter Kubernetes Job has failed. Attempting to transition the status to `Failed`.", "Submitter K8s Job", job.Name, "Reason", cond.Reason, "Message", cond.Message) | |
rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | |
// The submitter Job needs to wait for the user code to finish and retrieve its logs. | |
// Therefore, a failed Submitter Job indicates that the submission itself has failed or the user code has thrown an error. | |
// If the failure is due to user code, the JobStatus and Job message will be updated accordingly from the previous reconciliation. | |
if rayJob.Status.JobStatus == rayv1.JobStatusFailed { | |
rayJob.Status.Reason = rayv1.AppFailed | |
} else { | |
rayJob.Status.Reason = rayv1.SubmissionFailed | |
rayJob.Status.Message = fmt.Sprintf("Job submission has failed. Reason: %s. Message: %s", cond.Reason, cond.Message) | |
} | |
return true | |
} | |
} | |
return false | |
} |
Or to update RayJob status when failed to retrieve ray job?
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 271 to 284 in ea0b9c5
jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId) | |
if err != nil { | |
// If the Ray job was not found, GetJobInfo returns a BadRequest error. | |
if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode && errors.IsBadRequest(err) { | |
logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId) | |
if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil { | |
logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId) | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
} | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil | |
} | |
logger.Error(err, "Failed to get job info", "JobId", rayJobInstance.Status.JobId) | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
} |
Are you willing to submit a PR?
- Yes I am willing to submit a PR!