Skip to content

Conversation

@liyinan926
Copy link
Contributor

@liyinan926 liyinan926 commented Mar 2, 2018

What changes were proposed in this pull request?

Spark on Kubernetes creates some auxiliary Kubernetes resources such as the headless driver service used by the executors to connect to the driver and the ConfigMap that carries configuration for the init-container. Such resources are no longer needed once an application completes, but they are still persisted by the API server and there should be deleted upon completion. This PR handles the cases when the submission waits for the application to complete, which is the default behavior as spark.kubernetes.submission.waitAppCompletion defaults to true.

Xref: apache-spark-on-k8s#520.

How was this patch tested?

Unit tested.

@felixcheung @foxish

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1232/

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87903 has finished for PR 20722 at commit 9890ebc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1232/

logInfo(s"Waiting for application $appName to finish...")
watcher.awaitCompletion()
logInfo(s"Application $appName finished.")
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this in client code? Driver shutdown is the right place to perform cleanup right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also because this code path isn't invoked in fire-and-forget mode IIUC)

Copy link
Contributor Author

@liyinan926 liyinan926 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this before. The main reason is this requires giving the driver extra permissions to delete things. This was not a favorable idea. Do you have different thoughts now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! It was the RBAC rules and downscoping them that led us here. I'm concerned not all jobs will actually use this interactive mode of launching. What do you think of just granting more permissions to the driver and allowing cleanup there?

Copy link
Contributor Author

@liyinan926 liyinan926 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the discussion offline, I think the right solution is move resource management to the driver pod. This way, resource cleanup is guaranteed regardless of the deployment mode and whether the client waits for completion or not.

@felixcheung
Copy link
Member

so where are we on this?

@liyinan926
Copy link
Contributor Author

@felixcheung I think we want to move resource management to the driver such that the life time of the resources is bound to the life time of the driver pod. This works regardless of if the user chooses to wait for the application to finish or not. One tricky thing is the creation of the headless driver service. It must be created before the executors are created as otherwise the executors wouldn't be able to connect to the driver. I'm trying to figure out how we can achieve this order guarantee in the KubernetesSchedulerBackend.

@liyinan926 liyinan926 closed this Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants