[DCOS-34549] Mesos label NPE fix #60

rpalaznik · 2019-05-29T13:41:01Z

What changes were proposed in this pull request?

Fixes NullPointerException that prevents SparkException being properly logged during task scheduling in Mesos.
Adds label format validation for properties spark.mesos.network.labels, spark.mesos.task.labels, and spark.mesos.driver.labels in dispatcher.

How was this patch tested?

Manual testing

… the error is logged

samvantran

Thanks @rpalaznik. I left some comments around simplifying the validate method and some small format nits but overall I like the approach. One extra thing - can we add a unit test? Especially since this is meant to fix an NPE bug.

samvantran · 2019-05-29T20:58:02Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

  }
+
+  private[mesos] def validateLabelFormat(properties: Map[String, String]
+                                       , propertyNames: String*): Unit = {


strange place for a comma; surprised there's no linter here. Can this be:

validateLabelFormat(properties: Map[String, String], propertyNames: String*): Unit = {

samvantran · 2019-05-29T21:41:32Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

    val driverCores = sparkProperties.get("spark.driver.cores")
    val name = request.sparkProperties.getOrElse("spark.app.name", mainClass)

+    validateLabelFormat(sparkProperties, "spark.mesos.network.labels", "spark.mesos.task.labels",


is this a finite list? If you know ahead of time what the list of labels to validate against are, I think you should move it inside the function so all you pass is the sparkProperties map e.g.

validateLabelFormat(sparkProperties)

changed as suggested

samvantran · 2019-05-29T21:47:11Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

+
+  private[mesos] def validateLabelFormat(properties: Map[String, String]
+                                       , propertyNames: String*): Unit = {
+    propertyNames.foreach { name =>


as mentioned above, if the labels don't change, I think we can remove the propertyNames arg and do a simple

List("spark.mesos.network.labels", "spark.mesos.task.labels", ...).foreach { =>

changed as suggested

samvantran · 2019-05-29T21:52:23Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

+          MesosProtoUtils.mesosLabels(label)
+        } catch {
+          case _ => throw new SubmitRestProtocolException("Malformed label in " +
+            f"${name}: ${label}")


small nit but string interpolation in Scala is usually done w/ s"${variable}" while f" is usually done w/ types and digits

changed as suggested

…ide the function itself

samvantran

Changes look good @rpalaznik ! Thanks for addressing them. I had a minor comment on test naming but not a blocker in any form.

@akirillov can give the final stamp of approval but this LGTM

samvantran · 2019-05-31T22:25:04Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

+            // If there are fields that the server does not know about, warn the client
+            s.unknownFields = unknownFields
+          }
+          s


I know this isn't you but man... single char variable names are the worst!

samvantran · 2019-05-31T22:28:46Z

...mesos/src/test/scala/org/apache/spark/deploy/rest/mesos/MesosSubmitRequestServletSuite.scala

    assert("k0:v0,k1:v1" == driverConf.get("spark.mesos.network.labels"))
  }
+
+  test("test a job with malformed labels is not submitted") {


can you change to test a job with malformed labels throws exception?

akirillov

thanks, @rpalaznik. LGTM, but please address the comments.

akirillov · 2019-06-03T04:51:49Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

    }
  }
+
+  private[mesos] def validateLabelFormat(properties: Map[String, String]): Unit = {


let's change the method name into a plural form: validateLabelsFormat

akirillov · 2019-06-03T05:05:07Z

resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala

-          // If there are fields that the server does not know about, warn the client
-          s.unknownFields = unknownFields
+        try {
+          val driverDescription = buildDriverDescription(submitRequest)


While exceptions have been used in buildDriverDescription before, this approach doesn't look quite right. When we have a method returning value or throwing an error it's a natural use case for Scala's Either or Try. On the other hand, there're no usages of Either or Try in mesos package. On the other hand, having try-catch block here is a good defensive approach which can save us from surprises.

Given the status of the codebase for Mesos Scheduler in general, a more severe refactoring is needed so I'd say, let's keep the current code as is.

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) # Conflicts: # dev/deps/spark-deps-hadoop-2.6 # dev/deps/spark-deps-hadoop-2.7 # dev/deps/spark-deps-hadoop-3.1 # pom.xml * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

* Support for DSCOS_SERVICE_ACCOUNT_CREDENTIAL environment variable in MesosClusterScheduler * File Based Secrets support * [SPARK-723][SPARK-740] Add Metrics to Dispatcher and Driver - Counters: The total number of times that submissions have entered states - Timers: The duration from submit or launch until a submission entered a given state - Histogram: The retry counts at time of retry * Fixes to handling finished drivers - Rename 'failed' case to 'exception' - When a driver is 'finished', record its final MesosTaskState - Fix naming consistency after seeing how they look in practice * Register "finished" counters up-front Otherwise their values are never published. * [SPARK-692] Added spark.mesos.executor.gpus to specify the number of Executor CPUs * [SPARK-23941][MESOS] Mesos task failed on specific spark app name (#33) * [SPARK-23941][MESOS] Mesos task failed on specific spark app name Port from SPARK#21014 ** edit: not a direct port from upstream Spark. Changes were needed because we saw PySpark jobs fail to launch when 1) run with docker and 2) including --py-files ============== * Shell escape only appName, mainClass, default and driverConf Specifically, we do not want to shell-escape the --py-files. What we've seen IRL is that for spark jobs that use docker images coupled w/ python files, the $MESOS_SANDBOX path is escaped and results in FileNotFoundErrors during py4j.SparkSession.getOrCreate * [DCOS-39150][SPARK] Support unique Executor IDs in cluster managers (#36) Using incremental integers as Executor IDs leads to a situation when Spark Executors launched by different Drivers have same IDs. This leads to a situation when Mesos Task IDs for multiple Spark Executors are the same too. This PR prepends UUID unique for a CoarseGrainedSchedulerBackend instance to numeric ID thus allowing to distinguish Executors belonging to different drivers. This PR reverts commit ebe3c7f "[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n…)" * Upgrade of Hadoop, ZooKeeper, and Jackson libraries to fix CVEs. Updates for JSON-related tests. (#43) List of upgrades for 3rd-party libraries having CVEs: - Hadoop: 2.7.3 -> 2.7.7. Fixes: CVE-2016-6811, CVE-2017-3166, CVE-2017-3162, CVE-2018-8009 - Jackson 2.6.5 -> 2.9.6. Fixes: CVE-2017-15095, CVE-2017-17485, CVE-2017-7525, CVE-2018-7489, CVE-2016-3720 - ZooKeeper 3.4.6 -> 3.4.13 (https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html) * CNI Support for Docker containerizer, binding to SPARK_LOCAL_IP instead of 0.0.0.0 to properly advertise executors during shuffle (#44) * Spark Dispatcher support for launching applications in the same virtual network by default (#45) * [DCOS-46585] Fix supervised driver retry logic for outdated tasks (#46) This commit fixes a bug where `--supervised` drivers would relaunch after receiving an outdated status update from a restarted/crashed agent even if they had already been relaunched and running elsewhere. In those scenarios, previous logic would cause two identical jobs to be running and ZK state would only have a record of the latest one effectively orphaning the 1st job. * Revert "[SPARK-25088][CORE][MESOS][DOCS] Update Rest Server docs & defaults." This reverts commit 1024875. The change introduced in the reverted commit is breaking: - breaks semantics of `spark.master.rest.enabled` which belongs to Spark Standalone Master only but not to SparkSubmit - reverts the default behavior for Spark Standalone from REST to legacy RPC - contains misleading messages in `require` assertion blocks - prevents users from running jobs without specifying `spark.master.rest.enabled` * [DCOS-49020] Specify user in CommandInfo for Spark Driver launched on Mesos (#49) * [DCOS-40974] Mesos checkpointing support for Spark Drivers (#51) * [DCOS-51158] Improved Task ID assignment for Executor tasks (#52) * [DCOS-51454] Remove irrelevant Mesos REPL test (#54) * [DCOS-51453] Added Hadoop 2.9 profile (#53) * [DCOS-34235] spark.mesos.executor.memoryOverhead equivalent for the Driver when running on Mesos (#55) * Refactoring of metrics naming to add mesos semantics and avoid clashing with existing Spark metrics (#58) * [DCOS-34549] Mesos label NPE fix (#60)

Roman Palaznik added 2 commits May 28, 2019 17:04

DCOS-34549: Fix NPE that happens when handling SparkException, before…

a49d0c6

… the error is logged

DCOS-34549: Adds label properties validation during submission

42661e9

rpalaznik requested review from akirillov, alembiewski, farhan5900 and samvantran May 29, 2019 13:41

samvantran suggested changes May 29, 2019

View reviewed changes

Roman Palaznik added 5 commits May 30, 2019 14:28

DCOS-34549: Removes stack trace logging for malformed label error

7840a6f

DCOS-34549: Adds valid format explanation.

62da58d

DCOS-34549: Moved the list of label properties from the arguments ins…

8e90028

…ide the function itself

DCOS-34549: the rest of suggested changes

d9e546f

DCOS-34549: adds a unit tests, fixes a warning

c8511d5

rpalaznik requested a review from samvantran May 31, 2019 17:12

samvantran approved these changes May 31, 2019

View reviewed changes

akirillov approved these changes Jun 3, 2019

View reviewed changes

DCOS-34549: Pluralize method name

26eb21a

rpalaznik merged commit 5adf2e4 into d2iq-archive:custom-branch-2.4.x Jun 3, 2019

alembiewski pushed a commit that referenced this pull request Jun 12, 2019

[DCOS-34549] Mesos label NPE fix (#60)

b5a9366

alembiewski mentioned this pull request Jun 13, 2019

[DCOS-54813] Base tech update from 2.4.0 to 2.4.3 #62

Merged

farhan5900 mentioned this pull request Oct 17, 2020

[DCOS-34549] Mesos label NPE fix #73

Closed

[DCOS-34549] Mesos label NPE fix #60

[DCOS-34549] Mesos label NPE fix #60

Uh oh!

Conversation

rpalaznik commented May 29, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samvantran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akirillov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants