add mpi-operator(v1) to the unified operator#1457
add mpi-operator(v1) to the unified operator#1457google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
Conversation
|
Hi @hackerboy01. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
9c76664 to
a030ab0
Compare
terrytangyuan
left a comment
There was a problem hiding this comment.
Wha happened to the previous PR for this?
4ff9a74 to
82071c3
Compare
|
Hi @hackerboy01 Is it ready to review? |
zw0610
left a comment
There was a problem hiding this comment.
generally looks great. We might still need tiny fix.
| //+kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;update;patch;delete | ||
| //+kubebuilder:rbac:groups="",resources=serviceaccount,verbs=get;list;watch;create;delete | ||
| //+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete | ||
| //+kubebuilder:rbac:groups="rbac.authorization.k8s.io",resources=roles,verbs=get;list;watch;create;delete |
There was a problem hiding this comment.
To support elastic feature (horovod-elastic), we also need permission to update role: https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/v1/mpi_job_controller.go#L777
| utilruntime.HandleError(fmt.Errorf("couldn't get jobKey for job object %#v: %v", mpijob, err)) | ||
| } | ||
| replicaTypes := util.GetReplicaTypes(mpijob.Spec.MPIReplicaSpecs) | ||
| needReconcile := util.SatisfiedExpectations(r.Expectations, jobKey, replicaTypes) |
There was a problem hiding this comment.
function SatisfiedExpectations only checks Pod and Service, which is sufficient for other job API. However, I wonder if MPIJob needs a more sophisticated SatisfiedExpectations implementation.
There was a problem hiding this comment.
@alculquicondor you mean mpi-operator had not ever checked excpectations?
There was a problem hiding this comment.
I mean, did the author check @zw0610's questions?
| return err | ||
| } | ||
|
|
||
| // inject watching for job related service |
There was a problem hiding this comment.
I think for mpi-controller.v1, there won't be job-related service. Instead, ServiceAccount, Role, RoleBinding and other resource may need be watched.
There was a problem hiding this comment.
I don't think the controller works properly without solving Wang's comment.
fa323e5 to
d57a6e4
Compare
terrytangyuan
left a comment
There was a problem hiding this comment.
Great work! Let's make sure we don't introduce additional enhancements and keep this PR almost like copy-paste from the existing controller code. This way it's easy to track/review changes.
|
/cc @alculquicondor |
|
can you also move the unit tests please? |
75dfdb3 to
da115b2
Compare
| msg := fmt.Sprintf("MPIReplicaSpec is not valid: Image is undefined in the container of %v", rType) | ||
| return fmt.Errorf(msg) | ||
| } | ||
| // if container.Name == mpiv1.DefaultContainerName { |
There was a problem hiding this comment.
@alculquicondor We will move the unit tests in the next pr.
There was a problem hiding this comment.
Please remove this commented code if it's not actually needed.
fix: controller-gen fix:fmt
@hackerboy01 I think you'll want to include the tests in this PR. Otherwise, we have no way to verify the correctness of the migrated code. |
As we changed from the controller mode to reconciler mode, I would suggest only moving unit tests which do not use the old controller directly. Tests with the old controller mode does not verify the correctness of this new reconciler. For the next pr, we can follow the suite_test.go and tensorflow_test.go and add unit tests for the reconciler. What do you think? @terrytangyuan Meanwhile, it seems we also need to add more tests to tensorflow reconciler as well. @Jeffwan |
|
Why adding tests in the next PR instead of having it as part of this PR? |
Because we lack a good example among existing reconcilers. Anyone could add tests for tf/pytorch reconciler so mpi reconciler can follow the path? |
Yes, the unit tests need to be adapted, just like the code was adapted. But we shouldn't add such a significant amount of code without tests in the same PR. The contributor could decide to leave and then we have untested code to maintain. |
Pull Request Test Coverage Report for Build 1460365743
💛 - Coveralls |
|
We just converted all test for v1 controller from mpi-operator to this pr. |
|
LGTM. Thank you for adding the tests! /cc @alculquicondor Would you like to take another look before we merge? |
|
Can we move this pr forward? @terrytangyuan |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hackerboy01, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Sorry, I was on vacation. I'll take a look regardless and hopefully we can fix any problems in a follow up. |
|
@hackerboy01 are you going to be migrating the v2 operator as well? |
Let's start the discussion on v2 migration in #1479 instead of on this merged PR. |
alculquicondor
left a comment
There was a problem hiding this comment.
I don't think I have more capacity to look at the v1 controller changes, but I'm not convinced that this PR went through a proper review cycle. There are a handful of important open questions that were never looked at.
| - apiGroups: | ||
| - policy | ||
| resources: | ||
| - poddisruptionbudgets |
There was a problem hiding this comment.
I don't recall a need for this either
| @@ -0,0 +1,38 @@ | |||
| // Copyright 2019 The Kubeflow Authors | |||
| // Package v1 contains API Schema definitions for the kubeflow.org v1 API group | ||
| //+kubebuilder:object:generate=true | ||
| //+groupName=kubeflow.org | ||
| package v1 |
There was a problem hiding this comment.
why do we have a doc.go file if there is already this comment here?
| utilruntime.HandleError(fmt.Errorf("couldn't get jobKey for job object %#v: %v", mpijob, err)) | ||
| } | ||
| replicaTypes := util.GetReplicaTypes(mpijob.Spec.MPIReplicaSpecs) | ||
| needReconcile := util.SatisfiedExpectations(r.Expectations, jobKey, replicaTypes) |
| return err | ||
| } | ||
|
|
||
| // inject watching for job related service |
There was a problem hiding this comment.
I don't think the controller works properly without solving Wang's comment.
| } | ||
|
|
||
| //mpijob not need delete services | ||
| func (r *MPIJobReconciler) DeletePodsAndServices(runPolicy *commonv1.RunPolicy, job interface{}, pods []*corev1.Pod) error { |
| // reconcileServices checks and updates services for each given ReplicaSpec. | ||
| // It will requeue the job in case of an error while creating/deleting services. | ||
| // mpijob not need services | ||
| func (jc *MPIJobReconciler) ReconcileServices( |
There was a problem hiding this comment.
do we need to implement this function?
| return true | ||
| } | ||
|
|
||
| r.Scheme.Default(mpiJob) |
There was a problem hiding this comment.
I'm not convinced that this is safe. If a user changes an MPIJob before the controller had a chance to update the job, I think the defaults would be cleared. And maybe there could be a race condition in the client cache, but I'm not sure how the cache is implemented in controller-runtime.
cc @Jeffwan who did the same for tf-operator
There was a problem hiding this comment.
@alculquicondor
I did same thing for tf. Currently, it's still using a lightweight solution, the ideal way is to move the logics to webhooks. If user changes the job, the event sequence for the same object is guaranteed but since they are different kind of events and handled by different methods, I would say there's chance to get into the race condition.
T1: onOwnerCreateFunc is revoked and it tries to set default RunPolicy
T2: onUpdate() is revoked and it updates similar fields
T3. In the end, onOwnerCreateFunc set defaults.
Honestly, I have not played with this case. I agree even if we won't update that fast but the case is still there theoretically. Is there a better way to handle this without webhook? Maybe we can move most of logics to Defaulting https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#defaulting
|
@hackerboy01 and I have reviewed the comments above. Fixing pull requests will follow this week. |
| - apiGroups: | ||
| - apiextensions.k8s.io | ||
| resources: | ||
| - customresourcedefinitions |
There was a problem hiding this comment.
Could you remind me what CRD needs to be created?
There was a problem hiding this comment.
Uhm... I wouldn't expect the controller to create CRDs
| - apiGroups: | ||
| - policy | ||
| resources: | ||
| - poddisruptionbudgets |
| } | ||
|
|
||
| if err = validation.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil { | ||
| logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String()) |
There was a problem hiding this comment.
Agree. We return put the object back to the queue. It's worthless to move it forward.
Another thing is I think the original validation logic is kind of outdated. Personally I don't see too many specific fields to validate. With latest controller tools, I think most of these invalid cases would be blocked by CRD validation.
| return true | ||
| } | ||
|
|
||
| r.Scheme.Default(mpiJob) |
There was a problem hiding this comment.
@alculquicondor
I did same thing for tf. Currently, it's still using a lightweight solution, the ideal way is to move the logics to webhooks. If user changes the job, the event sequence for the same object is guaranteed but since they are different kind of events and handled by different methods, I would say there's chance to get into the race condition.
T1: onOwnerCreateFunc is revoked and it tries to set default RunPolicy
T2: onUpdate() is revoked and it updates similar fields
T3. In the end, onOwnerCreateFunc set defaults.
Honestly, I have not played with this case. I agree even if we won't update that fast but the case is still there theoretically. Is there a better way to handle this without webhook? Maybe we can move most of logics to Defaulting https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#defaulting
|
@hackerboy01 Great work. Please help address the comments from @alculquicondor @zw0610 and me. Let's create a reliable and stable version for the users. |
add mpi-operator(v1) to the unified operator