Fail job for non-retryable exit codes#2071
Fail job for non-retryable exit codes#2071google-oss-prow[bot] merged 8 commits intokubeflow:masterfrom
Conversation
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
Pull Request Test Coverage Report for Build 8850191138Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
tenzen-y
left a comment
There was a problem hiding this comment.
Thank you for creating this PR!
Could you add another test case here: https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/pkg/controller.v1/tensorflow/pod_test.go?
I think that this is good example to implement another case: https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/pkg/controller.v1/tensorflow/pod_test.go#L212-L275
pkg/controller.v1/common/pod.go
Outdated
| } else if pod.Status.Phase == v1.PodFailed && | ||
| (spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) { | ||
| logger.Infof("Pod has a non-retryable exit code. Failing job. %v %v", pod.Namespace, pod.Name) | ||
| msg := fmt.Sprintf("job %s is failing because %s replica(s) failed.", | ||
| metaObject.GetName(), rType) | ||
| jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) | ||
| commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) |
There was a problem hiding this comment.
Could you improve debuggability?
| } else if pod.Status.Phase == v1.PodFailed && | |
| (spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) { | |
| logger.Infof("Pod has a non-retryable exit code. Failing job. %v %v", pod.Namespace, pod.Name) | |
| msg := fmt.Sprintf("job %s is failing because %s replica(s) failed.", | |
| metaObject.GetName(), rType) | |
| jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) | |
| commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) | |
| } else if pod.Status.Phase == v1.PodFailed && | |
| (spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode)) { | |
| logger.Infof("Pod %q has a non-retryable exit code. Failing job.", klog.KObj(pod)) | |
| msg := fmt.Sprintf("job %q is failing because %q replica(s) failed.", | |
| metaObject.GetName(), rType) | |
| jc.Recorder.Event(runtimeObject, v1.EventTypeWarning, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) | |
| commonutil.UpdateJobConditions(jobStatus, apiv1.JobFailed, v1.ConditionTrue, commonutil.NewReason(jobKind, commonutil.JobFailedReason), msg) |
Additionally, could you add another level nest?
if pod.Status.Phase == v1.PodFailed {
failedPodsCount.Inc()
if spec.RestartPolicy == apiv1.RestartPolicyExitCode && trainutil.IsRetryableExitCode(exitCode) ||
spec.RestartPolicy == apiv1.RestartPolicyOnFailure ||
spec.RestartPolicy == apiv1.RestartPolicyAlways {
// Existing Codes
} else if spec.RestartPolicy == apiv1.RestartPolicyExitCode && !trainutil.IsRetryableExitCode(exitCode) {
// New Codes
}
}|
@kellyaa Were you able to find time to implement integration test for your change ? |
|
@andreyvelich Working on it as we speak and hope to get it in today. Please let me know if this is holding up the train. |
|
Thanks @kellyaa |
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
Test case added and updated with improved logging statements. Ready for re-review! |
… testing Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
@kellyaa The new test case could not perform appropriately:
So, could you fix that problem? |
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
Resubmitted! Thanks for your patience as I ramp up on this code! |
|
/test all |
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
/test all |
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
/test all |
|
@kellyaa Please rebase your PR to get the latest fix for CI |
|
/test all |
| Eventually(func() bool { | ||
| updatedJob := &kubeflowv1.TFJob{} | ||
| err := testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob) | ||
| if err != nil { | ||
| return false | ||
| } | ||
| for _, condition := range updatedJob.Status.Conditions { | ||
| if condition.Type == kubeflowv1.JobFailed && condition.Status == corev1.ConditionTrue { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| }, testutil.Timeout, testutil.Interval).Should(BeTrue(), "TFJob should be in Failed state") |
There was a problem hiding this comment.
| Eventually(func() bool { | |
| updatedJob := &kubeflowv1.TFJob{} | |
| err := testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob) | |
| if err != nil { | |
| return false | |
| } | |
| for _, condition := range updatedJob.Status.Conditions { | |
| if condition.Type == kubeflowv1.JobFailed && condition.Status == corev1.ConditionTrue { | |
| return true | |
| } | |
| } | |
| return false | |
| }, testutil.Timeout, testutil.Interval).Should(BeTrue(), "TFJob should be in Failed state") | |
| Eventually(func(g Gomega) { | |
| updatedJob := &kubeflowv1.TFJob{} | |
| g.Expect(testK8sClient.Get(ctx, types.NamespacedName{Name: tfJob.GetName(), Namespace: metav1.NamespaceDefault}, updatedJob)).Should(Succeeded()) | |
| g.Expect(updatedJob.Status.Conditions).Should(ContainElements( | |
| BeComparableTo(metav1.Condition{ | |
| Type: kubeflowv1.JobFailed, | |
| Status: corev1.ConditionTrue, | |
| }, cmpopts.IgnoreFields(metav1.Condition{}, "LastTransitionTime", "Reason", "Message", "ObservedGeneration"), "TFJob should be in Failed state"), | |
| )) | |
| }, testutil.Timeout, testutil.Interval).Should(Succeeded()) |
Could you improve debuggability like this?
| }, | ||
| }, | ||
| }) | ||
| Expect(testK8sClient.Status().Update(ctx, created)) |
There was a problem hiding this comment.
| Expect(testK8sClient.Status().Update(ctx, created)) | |
| Expect(testK8sClient.Status().Update(ctx, created)).Should(Succeeded()) |
| }) | ||
| Expect(testK8sClient.Status().Update(ctx, created)) | ||
|
|
||
| _ = reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy) |
There was a problem hiding this comment.
| _ = reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy) | |
| Expect(reconciler.ReconcileJobs(tfJob, tfJob.Spec.TFReplicaSpecs, tfJob.Status, &tfJob.Spec.RunPolicy)).Should(Succeeded) |
There was a problem hiding this comment.
I've noticed that if I use Expect here when testing locally, it fails because the reconciler tries to recreate the pod that exists. If I take this line out completely, it still succeeds on my laptop. However, it fails in the 1.27.1 GH Action tests.
There was a problem hiding this comment.
Oh, I see. Thank you for sharing that.
Indeed, our tests have some unintended and not understandable issues...
So, let's keep using your implementations for now.
There was a problem hiding this comment.
/LGTM Verified with the following:
git clone --single-branch --branch master https://github.com/kellyaa/training-operator.git
cd training-operator
export INGRESS_NGINX_VERSION=controller-v1.9.6
# Setup KinD
echo "Creating KinD cluster"
kind delete cluster -n training-operator-cluster
cat <<EOF | kind create cluster --name training-operator-cluster --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
image: kindest/node:v1.25.3@sha256:f52781bc0d7a19fb6c405c2af83abfeb311f130707a0e219175677e366cc45d1
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
EOF
echo "Deploying Ingress controller into KinD cluster"
curl https://raw.githubusercontent.com/kubernetes/ingress-nginx/"${INGRESS_NGINX_VERSION}"/deploy/static/provider/kind/deploy.yaml | sed "s/--publish-status-address=localhost/--report-node-internal-ip-address\\n - --status-update-interval=10/g" | kubectl apply -f -
kubectl annotate ingressclass nginx "ingressclass.kubernetes.io/is-default-class=true"
kubectl -n ingress-nginx wait --timeout=300s --for=condition=Available deployments --all
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: kubeflow
name: kubeflow
spec:
finalizers:
- kubernetes
---
apiVersion: v1
kind: Secret
metadata:
name: training-operator-webhook-cert
namespace: kubeflow
type: Opaque
EOF
# Run k8s api and controller locally. Make sure controller log has no erro
make install && make run
Open another session:
cat <<EOF | oc create -f -
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: kfto-sft
namespace: default
spec:
pytorchReplicaSpecs:
Master:
restartPolicy: ExitCode
replicas: 1
template:
spec:
restartPolicy: ExitCode
containers:
- name: pytorch
image: quay.io/tedchang/alpine:latest
imagePullPolicy: Always
command: [/bin/sh, -c, 'sleep 30 && exit 2']
resources:
requests:
cpu: 1
memory: "200Mi"
Worker:
restartPolicy: ExitCode
replicas: 1
template:
spec:
restartPolicy: ExitCode
containers:
- name: pytorch
image: quay.io/tedchang/alpine:latest
imagePullPolicy: Always
command: [/bin/sh, -c, 'sleep 15 && exit 2']
resources:
requests:
cpu: 1
memory: "200Mi"
EOF
kubectl -n default wait --timeout=300s --for=condition=Created pytorchjobs kfto-sft
echo "pytorchjobs created"
kubectl -n default wait --timeout=300s --for=condition=Running pytorchjobs kfto-sft
echo "pytorchjobs running"
kubectl -n default wait --timeout=300s --for=condition=Failed pytorchjobs kfto-sft
echo "Job failed as expected"
Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
|
|
||
| . "github.com/onsi/ginkgo/v2" | ||
| . "github.com/onsi/gomega" | ||
| . "github.com/onsi/gomega/gstruct" |
There was a problem hiding this comment.
| . "github.com/onsi/gomega/gstruct" |
Could you remove this dependency?
There was a problem hiding this comment.
Oh, you introduced the IgnoreExtras.
NVM
There was a problem hiding this comment.
I needed it in order to use MatchFields, IgnoreExtras, Fields. For some reason ContainElements(BeComparableTo was not complaining that the fields did not match even though they appeared to be matching. Open to any other suggestions here.
There was a problem hiding this comment.
Uhm, It sounds curious.
Let me try to investigate it.
Anyway, I think that we can merge this for now.
terrytangyuan
left a comment
There was a problem hiding this comment.
Thank you!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kellyaa, tedhtchang, tenzen-y, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
* Fail job for non-retryable exit codes Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add test for non-retryable exit code Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add if nesting, remove manual creation of node in non-retry exit code testing Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Fix broken no-retry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Unbreak test for 1.27 Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Remove pod creation for noretry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Make nonretry exit code test more debugable Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> --------- Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
* Fail job for non-retryable exit codes Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add test for non-retryable exit code Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add if nesting, remove manual creation of node in non-retry exit code testing Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Fix broken no-retry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Unbreak test for 1.27 Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Remove pod creation for noretry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Make nonretry exit code test more debugable Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> --------- Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
* Fail job for non-retryable exit codes Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add test for non-retryable exit code Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add if nesting, remove manual creation of node in non-retry exit code testing Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Fix broken no-retry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Unbreak test for 1.27 Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Remove pod creation for noretry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Make nonretry exit code test more debugable Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> --------- Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
* Fail job for non-retryable exit codes Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add test for non-retryable exit code Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Add if nesting, remove manual creation of node in non-retry exit code testing Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Fix broken no-retry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Unbreak test for 1.27 Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Remove pod creation for noretry exit code test Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> * Make nonretry exit code test more debugable Signed-off-by: Kelly A <kellyaa@users.noreply.github.com> --------- Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
Fixes #2044