fix https://github.com/kubeflow/training-operator/issues/1706#1707
Conversation
cmd/training-operator.v1/main.go
Outdated
| "If set, it only monitors kubeflow jobs in the given namespace.") | ||
| flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+ | ||
| "It can be set to \"0\" to disable the metrics serving.") | ||
| flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.") |
There was a problem hiding this comment.
Can we keep current defaults here(1)?
There was a problem hiding this comment.
Most users are too "lazy" to change the default values, or they wish the default values can handle most situations, I think the default value set to 10 is better?
There was a problem hiding this comment.
This value is highly dependent on deployment environment and user needs. I would recommend to keep it default 1 and user can override it based on their needs. (Ref: https://docs.okd.io/latest/operators/operator_sdk/golang/osdk-golang-tutorial.html#osdk-golang-controller-configs_osdk-golang-tutorial)
There was a problem hiding this comment.
@HeGaoYuan I agree with @johnugeorge. We should keep it default 1 since it would be best to set values as the minimal requirement one as default.
There was a problem hiding this comment.
So far I am still stick on my opinion.
spark-on-k8s-operator is a great project which is similar to training-operator. It use the default value 10 for controllerThreads.( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/c261df66a00635509f7d8cb2a7fba4c602c9228e/main.go#L56 )
And I think the robustness is more important than minimal requirement for a production-grade project.
Welcome more programmers to convince me of 1 is better than 10. 😁😁
There was a problem hiding this comment.
I was totally agree that the default value should be 1 here, regards as following reasons and more,
- the default value of
controller.Options.MaxConcurrentReconcilesis 1, cf - if we do not make sure that EVERYTHING now and future are completely thread safe, it's better to keep the minimal working configuration as default.
Anyway, keep default as DEFAULT would be better in my personal point of view.
There was a problem hiding this comment.
@HeGaoYuan Can you update the PR?
@johnugeorge done
Pull Request Test Coverage Report for Build 3781033970
💛 - Coveralls |
|
Related: #1708 |
cmd/training-operator.v1/main.go
Outdated
| "If set, it only monitors kubeflow jobs in the given namespace.") | ||
| flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+ | ||
| "It can be set to \"0\" to disable the metrics serving.") | ||
| flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.") |
There was a problem hiding this comment.
Should we set the controllerThreads per each framework since I don't think users create many Jobs using all frameworks simultaneously?
This means we introduce --tfjob-threads, pytorchjob-threads, and so on here instead of the --controller-threads.
@johnugeorge wdyt?
There was a problem hiding this comment.
Users now can enable and disable schemes that supported by the operator instance.
I think we can keep the command line options simple at first and update it to a more sophisticated version when real cases pump in.
There was a problem hiding this comment.
@zw0610 I see; we can select schemes in the following. Thanks for clarifying!
|
|
||
| // SetupWithManager sets up the controller with the Manager. | ||
| func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager, controllerThreads int) error { |
There was a problem hiding this comment.
Can we or should we introduce a way to keep the SetupWithManager API, may be extending the configuration struct is better than extending/changing API.
Is that make sense to introduce MaxConcurrentReconciles config to common.JobController ?
What's your opinion? @johnugeorge
There was a problem hiding this comment.
@kuizhiqing Yes. But I think, we need more refactoring. eg: NewReconciler can take common.JobControllerConfiguration as argument and we can add MaxConcurrentReconciles to common.JobControllerConfiguration.
We can do this in a separate PR as well.
e701c9a to
600dffe
Compare
|
Thanks @HeGaoYuan /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HeGaoYuan, johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Update sdk to v1.6.0rc0 * Update sdk to v1.6.0rc0 * Update sdk to v1.6.0rc0 * Update to latest Image * [Bugfix][SDK] pod has no metadata attr anymore in the get_job_logs() … (#1760) --------- Co-authored-by: Hongzhi Chen <chzyaobaiwei@gmail.com>
Which issue(s) this PR fixes : #1706
Checklist: