-
Notifications
You must be signed in to change notification settings - Fork 916
Closed
Description
Hi, everyone.
I want to test the failure tolerance of PytorchJob.
I started a PytorchJob with 1 master and 3 workers.
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
mnist-ddp-master-0 1/1 Running 0 2m55s 10.10.10.1 11.71.1.160
mnist-ddp-worker-0 1/1 Running 0 2m55s 10.10.10.2 11.71.1.161
mnist-ddp-worker-1 1/1 Running 0 2m55s 10.10.10.3 11.71.1.161
mnist-ddp-worker-2 1/1 Running 0 2m55s 10.10.10.4 11.71.1.162It trains fine.
Then I deleted a worker.
$ kubectl delete pod mnist-ddp-worker-1As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.
But sadly, I can't see this newborn worker join the DDP training.
Thanks.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels