Skip to content

Conversation

@sagor999
Copy link
Contributor

Hi,

this pull request adds node affinity support.
Added documentation.
Added e2e test.
All tests complete successfully locally.

Please let me know if anything is missing, would like to get this to finish line, as I have seen numerous requests and PR that did not make it.
Just going to reference them here so that once this is accepted, they all can be closed:
#975
#689
#493
#420
#209

@sagor999
Copy link
Contributor Author

Travis-CI failing with this:

hack/../pkg is out of date. Please run hack/update-codegen.sh
The command "hack/verify-codegen.sh" exited with 1.

I ran locally and it works fine. And I don't see any local changes.
Do I need to do anything to make CI work with this change?

@FxKu
Copy link
Member

FxKu commented Oct 14, 2020

Thanks @sagor999 for pushing this topic forward. I hope, I can find some time soon to do a review.

@FxKu FxKu added this to the 1.6 milestone Oct 14, 2020
@FxKu
Copy link
Member

FxKu commented Oct 19, 2020

Can you run ./hack/update-codegen.sh and commit changes? Then travis should pass, I think.

@sagor999
Copy link
Contributor Author

@FxKu hm, I just did run that script, but it did not produce any commitable changes.

./hack/update-codegen.sh 
Generating deepcopy funcs
Generating clientset for acid.zalan.do:v1 at github.com/zalando/postgres-operator/pkg/generated/clientset
Generating listers for acid.zalan.do:v1 at github.com/zalando/postgres-operator/pkg/generated/listers
Generating informers for acid.zalan.do:v1 at github.com/zalando/postgres-operator/pkg/generated/informers

Which is a bit odd. Do I need to do anything else?

@FxKu
Copy link
Member

FxKu commented Oct 20, 2020

Hm, not the first time, I'm hearing this. Also doesn't work for one of the committers. Anyway, you can take it from here: https://gist.github.com/FxKu/2eaae658c2c2fd2e4d11ccdeb28cb07f

I ran the codegen and it produced changes. Paste it under pkg/api/acid.zalan.do/v1

@sagor999
Copy link
Contributor Author

@FxKu CI is now passing. :)

@Jan-M
Copy link
Member

Jan-M commented Oct 26, 2020

@sagor999 thanks for the PR and that it does include tests.

For now I just want to comment on that in the future we should somehow start or engage a discussion early on on how the Postgres manifest should change. It is not my opinion that mapping through all these API objects is the right way to go.

E.g we could have aligned on a field for node affinity that just takes labels, making the "manifest" slim and not overblown. The type of affinity could have been some default.

nodeAffinity: "node-type=i3,reserved-for=postgres" would be something that comes to mind to specify affinity to node labels.

Or in your case simply: nodeAffinity: "environment=pci"

I want to hide some complexity of the K8s configuration from the user of our manifests.

@sagor999
Copy link
Contributor Author

@Jan-M That would have been nodeSelector, which does support what you are describing, for example:

nodeSelector:
  node-type: i3
  reserved-for: postgres

Which internally is mapped into nodeAffinity by kubernetes.
nodeAffinity does give you more flexibility though.
There were some discussion about deprecating nodeSelector from kubernetes API, but just checked and it seems like that idea was shut down.

So I think we can still support nodeAffinity as an advanced feature, and maybe add another PR that would add support for simpler nodeSelector instead?

@Jan-M
Copy link
Member

Jan-M commented Oct 26, 2020

Interesting discussion, from your point are you really looking for the full complexity of node affinity?

I think it is good that the operator creates "node affinity" question should just be what is define on the "user" side. Sometimes it can be helpful to limit the user to the useful options. If in the end we build "affinityComplexity" again we better just support it like you did.

In the end the topic of moving from nodeSelector to nodeAffinity also highlights that this could have been hidden from the user, what was originally nodeSelector in the Postgres manifest could just be translated to affinity.

@ReSearchITEng
Copy link
Contributor

ReSearchITEng commented Nov 14, 2020

Hi,
I could not find any nodeSelector parameter either in crd.
Not sure how to make a production deployment as there is neither nodeSelector nor affinity. We see this is as very important feature as it stops reaching production.

Regarding the nodeSelector vs affinity, I was checking what others are doing:
ElasticSearch: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-advanced-node-scheduling.html#k8s_node_affinity -> both nodeSelector and affinity (more exactly, it exposes the pod spec)
CouchBase: https://docs.couchbase.com/operator/current/reference-couchbasecluster.html#spec-servers-pod-spec-nodeselector - only nodeSelector, but note they have additional concepts like ServerGroups which brings some extra functionality on top.

As production deployments scenarios become complex relatively quickly, and probably that's one of the reasons the affinity was introduced: more complex logical conditions: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity

If I were to choose which would bring more production support, I would go with affinity.
(as for keeping pg simple: these are optional fields and won't clutter the pg definition when are not required).

@FxKu
Copy link
Member

FxKu commented Nov 17, 2020

@sagor999 can to you rebase with the master to resolve the conflict. We did some changes in the e2e tests.

@sagor999 sagor999 force-pushed the add_node_affinity branch 3 times, most recently from 3410849 to ec2cdbe Compare November 18, 2020 04:27
@sagor999
Copy link
Contributor Author

@FxKu done. Do you have an idea as to when you will merge this PR?

@sagor999
Copy link
Contributor Author

@FxKu updated e2e and removed hard coded sleep from it.

@sagor999
Copy link
Contributor Author

@FxKu I think I addressed all issues. Let me know if there is anything else.

@sagor999
Copy link
Contributor Author

@FxKu this test has failed: test_zz_node_readiness_label. But it ran fine locally on my machine. Is it known that this test might be flaky? Should I submit empty change to re-trigger tests again?

return nil
}

func testNodeAffinity(cluster *Cluster, podSpec *v1.PodTemplateSpec) error {
Copy link
Member

@FxKu FxKu Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test where this function is actually used? 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Facepalm. 🤣
Fixed.
btw, I found there are a bunch of functions like that in that file that are not used in actual tests.

Copy link
Member

@FxKu FxKu Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, we noticed that, too, unfortunately 😃

@FxKu
Copy link
Member

FxKu commented Dec 16, 2020

@sagor999 we found that your addition to unit tests is only a function. It's not called from test case (t *testing.T).

@sagor999
Copy link
Contributor Author

@FxKu fix submitted.

@Jan-M
Copy link
Member

Jan-M commented Dec 16, 2020

👍

@Jan-M
Copy link
Member

Jan-M commented Dec 16, 2020

Thank you @sagor999 for the quick responses to our not so quick responses. This allows us to put it into the 1.6 we try to sum up now.

@FxKu
Copy link
Member

FxKu commented Dec 16, 2020

👍

@FxKu FxKu merged commit 77252e3 into zalando:master Dec 16, 2020
@sagor999 sagor999 deleted the add_node_affinity branch December 16, 2020 13:58
This was referenced Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants