PytorchJob DDP training will stop if I delete a worker pod

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.
```shell
$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   10.10.10.1   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   10.10.10.2   11.71.1.161
mnist-ddp-worker-1   1/1     Running   0          2m55s   10.10.10.3   11.71.1.161
mnist-ddp-worker-2   1/1     Running   0          2m55s   10.10.10.4   11.71.1.162
```

It trains fine.

Then I deleted a worker.

```shell
$ kubectl delete pod mnist-ddp-worker-1
```

As I set `restartPolicy: OnFailure`, this pod will restart quickly with the same name `mnist-ddp-worker-1`.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchJob DDP training will stop if I delete a worker pod #1478

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PytorchJob DDP training will stop if I delete a worker pod #1478

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions