[Bug] RayCluster leaks in RayJob when the cluster terminated exceptionally

### Search before asking

- [x] I searched the [issues](https://github.com/ray-project/kuberay/issues) and found no similar issues.


### KubeRay Component

ray-operator

### What happened + What you expected to happen

Using RayJob with kuberay v1.3.2, when the pod of head node of the RayCluster terminated exceptionally (e.g. OOMKilled, evicted), the cluster got restarted during reconciliation (but losing the submitted ray jobs I presume). During the process, the submitter pod somehow goes into Complete state, leading to a resource leak where the RayCluster and RayJob will continue to be "Running" indefinitely even though there is no ray job at all in the cluster. 

RayJob parameters:
backoffLimit = 0, submitterConfig.backoffLimit = 0, restartPolicy = Never

```
# kubectl get rayjob
NAME                                   JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                                        START TIME             END TIME   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23   RUNNING      Running             2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh   2025-07-24T08:23:57Z              19h
78ff5104-413f-4208-b4d6-7d894a6f17f8   RUNNING      Running             78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4   2025-07-24T08:24:15Z              19h
a2e91677-229a-49f2-8e2b-0b59d513d694   RUNNING      Running             a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2   2025-07-24T08:24:20Z              19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271   RUNNING      Running             a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4   2025-07-24T08:24:10Z              19h
c3156a39-625a-4201-9b04-5030b84c6d8d   RUNNING      Running             c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z   2025-07-24T08:24:17Z              19h
d0571c80-7162-40e4-a0c5-617cd2d11b54   RUNNING      Running             d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9   2025-07-24T08:24:12Z              19h
# kubectl get raycluster
NAME                                                    DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8htlh                                         55     500Gi    0      ready    19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg8r4                                         55     500Gi    0      ready    19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q7c2                                         55     500Gi    0      ready    19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vfmr4                                         55     500Gi    0      ready    19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f695z                                         55     500Gi    0      ready    19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-tdcq9                                         55     500Gi    0      ready    19h
# kubectl get pod
NAME                                                            READY   STATUS      RESTARTS   AGE
2ff7dc07-d00c-490e-a64e-a463f442fe23-gszzs                      0/1     Completed   0          19h
2ff7dc07-d00c-490e-a64e-a463f442fe23-raycluster-8h-head-44hhb   1/1     Running     0          19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-raycluster-tg-head-6kf6z   1/1     Running     0          19h
78ff5104-413f-4208-b4d6-7d894a6f17f8-xb9k9                      0/1     Completed   0          19h
a2e91677-229a-49f2-8e2b-0b59d513d694-k49pd                      0/1     Completed   0          19h
a2e91677-229a-49f2-8e2b-0b59d513d694-raycluster-6q-head-llhvf   1/1     Running     0          19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-27n5b                      0/1     Completed   0          19h
a9d6f1cf-afe6-469c-942c-a4fdf94f5271-raycluster-vf-head-7rltl   1/1     Running     0          19h
c3156a39-625a-4201-9b04-5030b84c6d8d-2l5mb                      0/1     Completed   0          19h
c3156a39-625a-4201-9b04-5030b84c6d8d-raycluster-f6-head-dd6lv   1/1     Running     0          19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-4vsqv                      0/1     Completed   0          19h
d0571c80-7162-40e4-a0c5-617cd2d11b54-raycluster-td-head-54925   1/1     Running     0          19h
```

### Reproduction script

NA

### Anything else

Is this because the reconciler only checks the failed condition? Shouldn't it also check other conditions as well to prevent the leak?

https://github.com/ray-project/kuberay/blob/ea0b9c510bc35f2006d224f1445f4fde15d936c3/ray-operator/controllers/ray/rayjob_controller.go#L916-L935

Or to update RayJob status when failed to retrieve ray job?
https://github.com/ray-project/kuberay/blob/ea0b9c510bc35f2006d224f1445f4fde15d936c3/ray-operator/controllers/ray/rayjob_controller.go#L271-L284

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

	func checkK8sJobAndUpdateStatusIfNeeded(ctx context.Context, rayJob rayv1.RayJob, job batchv1.Job) bool {
	logger := ctrl.LoggerFrom(ctx)
	for _, cond := range job.Status.Conditions {
	if cond.Type == batchv1.JobFailed && cond.Status == corev1.ConditionTrue {
	logger.Info("The submitter Kubernetes Job has failed. Attempting to transition the status to `Failed`.", "Submitter K8s Job", job.Name, "Reason", cond.Reason, "Message", cond.Message)
	rayJob.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
	// The submitter Job needs to wait for the user code to finish and retrieve its logs.
	// Therefore, a failed Submitter Job indicates that the submission itself has failed or the user code has thrown an error.
	// If the failure is due to user code, the JobStatus and Job message will be updated accordingly from the previous reconciliation.
	if rayJob.Status.JobStatus == rayv1.JobStatusFailed {
	rayJob.Status.Reason = rayv1.AppFailed
	} else {
	rayJob.Status.Reason = rayv1.SubmissionFailed
	rayJob.Status.Message = fmt.Sprintf("Job submission has failed. Reason: %s. Message: %s", cond.Reason, cond.Message)
	}
	return true
	}
	}
	return false
	}

	jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId)
	if err != nil {
	// If the Ray job was not found, GetJobInfo returns a BadRequest error.
	if rayJobInstance.Spec.SubmissionMode == rayv1.HTTPMode && errors.IsBadRequest(err) {
	logger.Info("The Ray job was not found. Submit a Ray job via an HTTP request.", "JobId", rayJobInstance.Status.JobId)
	if _, err := rayDashboardClient.SubmitJob(ctx, rayJobInstance); err != nil {
	logger.Error(err, "Failed to submit the Ray job", "JobId", rayJobInstance.Status.JobId)
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
	}
	logger.Error(err, "Failed to get job info", "JobId", rayJobInstance.Status.JobId)
	return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] RayCluster leaks in RayJob when the cluster terminated exceptionally #3860

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] RayCluster leaks in RayJob when the cluster terminated exceptionally #3860

Description

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions