Skip to content

Fail job for non-retryable exit codes#2071

Merged
google-oss-prow[bot] merged 8 commits intokubeflow:masterfrom
kellyaa:master
Apr 26, 2024
Merged

Fail job for non-retryable exit codes#2071
google-oss-prow[bot] merged 8 commits intokubeflow:masterfrom
kellyaa:master

Conversation

@kellyaa
Copy link
Contributor

@kellyaa kellyaa commented Apr 18, 2024

Fixes #2044

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@coveralls
Copy link

coveralls commented Apr 18, 2024

Pull Request Test Coverage Report for Build 8850191138

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 21 (0.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.003%) to 35.303%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/pod.go 0 21 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/common/pod.go 1 2.77%
Totals Coverage Status
Change from base Build 8839866996: -0.003%
Covered Lines: 4380
Relevant Lines: 12407

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +366 to +372
} else if pod.Status.Phase == v1.PodFailed &&
(spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) {
logger.Infof("Pod has a non-retryable exit code. Failing job. %v %v", pod.Namespace, pod.Name)
msg := fmt.Sprintf("job %s is failing because %s replica(s) failed.",
metaObject.GetName(), rType)
jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)
commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you improve debuggability?

Suggested change
} else if pod.Status.Phase == v1.PodFailed &&
(spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) {
logger.Infof("Pod has a non-retryable exit code. Failing job. %v %v", pod.Namespace, pod.Name)
msg := fmt.Sprintf("job %s is failing because %s replica(s) failed.",
metaObject.GetName(), rType)
jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)
commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)
} else if pod.Status.Phase == v1.PodFailed &&
(spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) {
logger.Infof("Pod %q has a non-retryable exit code. Failing job.", klog.KObj(pod))
msg := fmt.Sprintf("job %q is failing because %q replica(s) failed.",
metaObject.GetName(), rType)
jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)
commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg)

Additionally, could you add another level nest?

		if pod.Status.Phase == v1.PodFailed {
			failedPodsCount.Inc()
			if spec.RestartPolicy == apiv1.RestartPolicyExitCode && trainutil.IsRetryableExitCode(exitCode) ||
					spec.RestartPolicy == apiv1.RestartPolicyOnFailure ||
					spec.RestartPolicy == apiv1.RestartPolicyAlways {
				// Existing Codes
			} else if spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode) {
				// New Codes
			}
		}

@andreyvelich
Copy link
Member

@kellyaa Were you able to find time to implement integration test for your change ?
We want to cut the first RC.0 soon, and it would be nice to include your change there.

@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 23, 2024

@andreyvelich Working on it as we speak and hope to get it in today. Please let me know if this is holding up the train.

@johnugeorge
Copy link
Member

Thanks @kellyaa

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/XS labels Apr 23, 2024
@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 23, 2024

Test case added and updated with improved logging statements. Ready for re-review!

… testing

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@tenzen-y
Copy link
Member

tenzen-y commented Apr 24, 2024

@kellyaa The new test case could not perform appropriately:


• [FAILED] [30.003 seconds]
TFJob controller Test Unretryable Exit Code [It] should set the job status to Failed
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/pod_test.go:278

https://github.com/kubeflow/training-operator/actions/runs/8807089592/job/24174443567?pr=2071#step:4:609

So, could you fix that problem?

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 24, 2024

Resubmitted! Thanks for your patience as I ramp up on this code!

@tedhtchang
Copy link

/test all

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@tedhtchang
Copy link

/test all

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 25, 2024

/test all

@andreyvelich
Copy link
Member

@kellyaa Please rebase your PR to get the latest fix for CI

@kellyaa
Copy link
Contributor Author

kellyaa commented Apr 26, 2024

/test all

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellyaa Thank you for this update!
Overall lgtm

Comment on lines +306 to +318
Eventually(func() bool {
updatedJob := &kubeflowv1.TFJob{}
err := testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob)
if err != nil {
return false
}
for _, condition := range updatedJob.Status.Conditions {
if condition.Type == kubeflowv1.JobFailed && condition.Status == corev1.ConditionTrue {
return true
}
}
return false
}, testutil.Timeout, testutil.Interval).Should(BeTrue(), "TFJob should be in Failed state")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Eventually(func() bool {
updatedJob := &kubeflowv1.TFJob{}
err := testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob)
if err != nil {
return false
}
for _, condition := range updatedJob.Status.Conditions {
if condition.Type == kubeflowv1.JobFailed && condition.Status == corev1.ConditionTrue {
return true
}
}
return false
}, testutil.Timeout, testutil.Interval).Should(BeTrue(), "TFJob should be in Failed state")
Eventually(func(g Gomega) {
updatedJob := &kubeflowv1.TFJob{}
g.Expect(testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob)).Should(Succeeded())
g.Expect(updatedJob.Status.Conditions).Should(ContainElements(
BeComparableTo(metav1.Condition{
Type: kubeflowv1.JobFailed,
Status: corev1.ConditionTrue,
}, cmpopts.IgnoreFields(metav1.Condition{}, "LastTransitionTime", "Reason", "Message", "ObservedGeneration"), "TFJob should be in Failed state"),
))
}, testutil.Timeout, testutil.Interval).Should(Succeeded())

Could you improve debuggability like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

},
},
})
Expect(testK8sClient.Status().Update(ctx, created))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expect(testK8sClient.Status().Update(ctx, created))
Expect(testK8sClient.Status().Update(ctx, created)).Should(Succeeded())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected!

})
Expect(testK8sClient.Status().Update(ctx, created))

_ = reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy)
Expect(reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy)).Should(Succeeded)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that if I use Expect here when testing locally, it fails because the reconciler tries to recreate the pod that exists. If I take this line out completely, it still succeeds on my laptop. However, it fails in the 1.27.1 GH Action tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thank you for sharing that.
Indeed, our tests have some unintended and not understandable issues...

So, let's keep using your implementations for now.

Copy link

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM Verified with the following:

git clone --single-branch --branch master https://github.com/kellyaa/training-operator.git
cd training-operator
export INGRESS_NGINX_VERSION=controller-v1.9.6

# Setup KinD
echo "Creating KinD cluster"
kind delete cluster -n training-operator-cluster
cat <<EOF | kind create cluster --name training-operator-cluster --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    image: kindest/node:v1.25.3@sha256:f52781bc0d7a19fb6c405c2af83abfeb311f130707a0e219175677e366cc45d1
    extraPortMappings:
    - containerPort: 80
      hostPort: 80
      protocol: TCP
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
EOF
echo "Deploying Ingress controller into KinD cluster"
curl https://raw.githubusercontent.com/kubernetes/ingress-nginx/"${INGRESS_NGINX_VERSION}"/deploy/static/provider/kind/deploy.yaml | sed "s/--publish-status-address=localhost/--report-node-internal-ip-address\\n        - --status-update-interval=10/g" | kubectl apply -f -
kubectl annotate ingressclass nginx "ingressclass.kubernetes.io/is-default-class=true"
kubectl -n ingress-nginx wait --timeout=300s --for=condition=Available deployments --all


cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: kubeflow
  name: kubeflow
spec:
  finalizers:
  - kubernetes
---
apiVersion: v1
kind: Secret
metadata:
  name: training-operator-webhook-cert
  namespace: kubeflow
type: Opaque
EOF
# Run k8s api and controller locally. Make sure controller log has no erro
make install && make run

Open another session:

cat <<EOF | oc create -f -
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: kfto-sft
  namespace: default
spec:
  pytorchReplicaSpecs:
    Master:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          restartPolicy: ExitCode
          containers:
            - name: pytorch
              image: quay.io/tedchang/alpine:latest
              imagePullPolicy: Always
              command: [/bin/sh, -c, 'sleep 30 && exit 2']
              resources:
                requests:
                  cpu: 1
                  memory: "200Mi"
    Worker:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          restartPolicy: ExitCode
          containers:
            - name: pytorch
              image: quay.io/tedchang/alpine:latest
              imagePullPolicy: Always
              command: [/bin/sh, -c, 'sleep 15 && exit 2']
              resources:
                requests:
                  cpu: 1
                  memory: "200Mi"
EOF
kubectl -n default wait --timeout=300s --for=condition=Created pytorchjobs kfto-sft
echo "pytorchjobs created"
kubectl -n default wait --timeout=300s --for=condition=Running pytorchjobs kfto-sft
echo "pytorchjobs running"
kubectl -n default wait --timeout=300s --for=condition=Failed pytorchjobs kfto-sft
echo "Job failed as expected"

@tedhtchang
Copy link

@kellyaa you may need to sign Corporate Agreement cla using IBM email

@andreyvelich
Copy link
Member

Thank you for the update!
@kellyaa Pease can you address @tenzen-y comments and we can merge it ?

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
. "github.com/onsi/gomega/gstruct"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. "github.com/onsi/gomega/gstruct"

Could you remove this dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you introduced the IgnoreExtras.
NVM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed it in order to use MatchFields, IgnoreExtras, Fields. For some reason ContainElements(BeComparableTo was not complaining that the fields did not match even though they appeared to be matching. Open to any other suggestions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, It sounds curious.
Let me try to investigate it.
Anyway, I think that we can merge this for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellyaa Thank you for fixing this!
/lgtm
/approve
/hold for CI

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kellyaa, tedhtchang, tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [tenzen-y,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 732fc12 into kubeflow:master Apr 26, 2024
ckyuto pushed a commit to ckyuto/training-operator that referenced this pull request Apr 26, 2024
* Fail job for non-retryable exit codes

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add test for non-retryable exit code

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add if nesting, remove manual creation of node in non-retry exit code testing

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Fix broken no-retry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Unbreak test for 1.27

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Remove pod creation for noretry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Make nonretry exit code test more debugable

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

---------

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
ckyuto pushed a commit to ckyuto/training-operator that referenced this pull request Apr 26, 2024
* Fail job for non-retryable exit codes

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add test for non-retryable exit code

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add if nesting, remove manual creation of node in non-retry exit code testing

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Fix broken no-retry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Unbreak test for 1.27

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Remove pod creation for noretry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Make nonretry exit code test more debugable

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

---------

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* Fail job for non-retryable exit codes

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add test for non-retryable exit code

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add if nesting, remove manual creation of node in non-retry exit code testing

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Fix broken no-retry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Unbreak test for 1.27

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Remove pod creation for noretry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Make nonretry exit code test more debugable

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

---------

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* Fail job for non-retryable exit codes

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add test for non-retryable exit code

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Add if nesting, remove manual creation of node in non-retry exit code testing

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Fix broken no-retry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Unbreak test for 1.27

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Remove pod creation for noretry exit code test

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

* Make nonretry exit code test more debugable

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>

---------

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PytorchJob restartPolicy: ExitCode leaves job in a "Restarting" state after a non-retryable error

7 participants