Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
3f8321a
Integration of ProcessTreeMetrics with PR 21221
Jul 26, 2018
cd16a75
Changing the position of ptree and also make the computation configur…
Aug 7, 2018
94c2b04
Seperate metrics for jvm, python and others and update the tests
Aug 8, 2018
062f5d7
Update JsonProtocolSuite
Sep 25, 2018
245221d
[SPARK-24958] Add executors' process tree total memory information to…
Oct 2, 2018
c72be03
Adressing most of Imran's comments
Oct 3, 2018
8f3c938
Fixing the scala style and some minor comments
Oct 3, 2018
f2dca27
Removing types from the definitions where ever possible
Oct 4, 2018
a9f924c
Using Utils methods when possible or use ProcessBuilder
Oct 5, 2018
a11e3a2
make use of Utils.trywithresources
Oct 5, 2018
34ad625
Changing ExecutorMericType and ExecutorMetrics to use a map instead o…
Oct 9, 2018
415f976
Changing ExecutorMetric to use array instead of a map
Oct 10, 2018
067b81d
A small cosmetic change
Oct 10, 2018
18ee4ad
Merge branch 'master' of https://github.com/apache/spark into ptreeme…
Oct 17, 2018
7f7ed2b
Applying latest review commments. Using Arrays instead of Map for ret…
Oct 23, 2018
f3867ff
Merge branch 'master' of https://github.com/apache/spark into ptreeme…
Nov 5, 2018
0f8f3e2
Fix an issue with jsonProtoclSuite
Nov 5, 2018
ea08c61
Fix scalastyle issue
Nov 5, 2018
8f20857
Applying latest review comments
Nov 14, 2018
6e65360
Using the companion object and other stuff
Nov 27, 2018
4659f4a
Update the use of process builder and applying other review comments
Nov 28, 2018
ef4be38
Small style fixes based on reviews
Nov 30, 2018
805741c
Applying review comments, mostly style related
Nov 30, 2018
4c1f073
emove the unnecessary trywithresources
Nov 30, 2018
0a7402e
Applying the comment about error handling and some more style fixes
Dec 4, 2018
3d65b35
Removing a return
Dec 6, 2018
6eab315
Reordering of info in a test resource file to avoid confusion
Dec 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Using Utils methods when possible or use ProcessBuilder
  • Loading branch information
Reza Safi committed Oct 5, 2018
commit a9f924c5943d6ed45e38a1c5aadd07045adbe138
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ import java.util.Locale

import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.Queue

import org.apache.spark.SparkEnv
import org.apache.spark.{SparkEnv, SparkException}
import org.apache.spark.internal.{config, Logging}
import org.apache.spark.util.Utils

private[spark] case class ProcfsBasedSystemsMetrics(
jvmVmemTotal: Long,
Expand All @@ -41,6 +41,7 @@ private[spark] case class ProcfsBasedSystemsMetrics(
// project.
private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extends Logging {
val procfsStatFile = "stat"
val testing = sys.env.contains("SPARK_TESTING") || sys.props.contains("spark.testing")
var pageSize = computePageSize()
var isAvailable: Boolean = isProcfsAvailable
private val pid = computePid()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pageSize is only a var for testing -- instead just optionally pass it in to the constructor

also I think all of these can be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can't call computePageSize() in the constructor signature to compute the default value. Another solution is to check for testing inside computePageSize and if we are testing assign a value to it that is provided in the constructor (default to 4096).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't put it as a default value, but if you make it a static method, then you can provide an overloaded method which uses it, see squito@cf00835

But, I think your other proposal is even better, if its testing just give it a fixed value (no need to even make it an argument to the constructor at all).

Expand All @@ -57,7 +58,6 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
computeProcessTree()

private def isProcfsAvailable: Boolean = {
val testing = sys.env.contains("SPARK_TESTING") || sys.props.contains("spark.testing")
if (testing) {
return true
}
Expand All @@ -77,40 +77,37 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
}

private def computePid(): Int = {
if (!isAvailable) {
if (!isAvailable || testing) {
return -1;
}
try {
// This can be simplified in java9:
// https://docs.oracle.com/javase/9/docs/api/java/lang/ProcessHandle.html
val cmd = Array("bash", "-c", "echo $PPID")
val length = 10
val out = Array.fill[Byte](length)(0)
Runtime.getRuntime.exec(cmd).getInputStream.read(out)
val pid = Integer.parseInt(new String(out, "UTF-8").trim)
val out2 = Utils.executeAndGetOutput(cmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be out instead of out2

val pid = Integer.parseInt(out2.split("\n")(0))
return pid;
}
catch {
case e: IOException => logDebug("IO Exception when trying to compute process tree." +
case e: SparkException => logDebug("IO Exception when trying to compute process tree." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only SparkException, not any Exception? also the msg shouldn't say "IO Exception".

should probably be logWarn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double check I thought there was a comment before that I should just get SparkException, but you are right. it doesn't make sense. Probably a mistake on my side. I was just caring about IOException here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it seems there wasn't a mistake here and I jut forgot the reason here. I caught SparkException since executeAndGetOutput may throw such an exception. I will remove the IOException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, executeAndGetOutput might throw a SparkException ... but are you sure nothing else will get thrown? Eg. what if you get some weird output and then the Integer.parseInt failse? Is there some reason you wouldn't want the same error handling for any exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was getting all throwables. Then I thought it can be dangerous. There was also a review comment about that. So Not sure what is the correct way of handling this. Is it better to just take care of exceptions that we know can be thrown or catch all throwables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a distinction between Throwable and Exception -- Throwable includes Errors which are fatal to the JVM, you probably can't do anything.

In general its a good question whether you should catch specific exceptions or everything. Here, you're calling an external program, and I don't feel super confident that we know how it always behaves, so I think we should be a little extra cautious. An unhandled exception here would lead to not sending any heartbeats, which would be really bad. Except for JVM errors, I think we just want to turn off this particular metric and keep going.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the old comment from @mccheah

Catching Throwable is generally scary, can this mask out of memory and errors of that sort? Can we scope down the exception type to handle here?

I think this (partially) agrees with what I said above, we dont' want to catch Throwable because that can mask other stuff where the jvm is hosed. But I still think Exception is the right thing to catch. sound ok @mccheah ?

if you really do want more specific exceptions, we should look through this more carefully to come up with a more exhaustive list, eg. I certainly don't want to fail the heartbeater because we dont' get an int out of the external call for some reason.

" As a result reporting of ProcessTree metrics is stopped", e)
isAvailable = false
return -1
}
}

private def computePageSize(): Long = {
val testing = sys.env.contains("SPARK_TESTING") || sys.props.contains("spark.testing")
if (testing) {
return 0;
}
val cmd = Array("getconf", "PAGESIZE")
val out = Array.fill[Byte](10)(0)
Runtime.getRuntime.exec(cmd).getInputStream.read(out)
return Integer.parseInt(new String(out, "UTF-8").trim)
val out2 = Utils.executeAndGetOutput(cmd)
return Integer.parseInt(out2.split("\n")(0))
}

private def computeProcessTree(): Unit = {
if (!isAvailable) {
if (!isAvailable || testing) {
return
}
val queue = mutable.Queue.empty[Int]
Expand All @@ -131,18 +128,26 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
private def getChildPids(pid: Int): ArrayBuffer[Int] = {
try {
val cmd = Array("pgrep", "-P", pid.toString)
val input = Runtime.getRuntime.exec(cmd).getInputStream
val childPidsInByte = mutable.ArrayBuffer.empty[Byte]
var d = input.read()
while (d != -1) {
childPidsInByte.append(d.asInstanceOf[Byte])
d = input.read()
val builder = new ProcessBuilder("pgrep", "-P", pid.toString)
val process = builder.start()
val output = new StringBuilder()
val threadName = "read stdout for " + "pgrep"
def appendToOutput(s: String): Unit = output.append(s).append("\n")
val stdoutThread = Utils.processStreamByLine(threadName,
process.getInputStream, appendToOutput)
val exitCode = process.waitFor()
stdoutThread.join()
// pgrep will have exit code of 1 if there are more than one child process
// and it will have a exit code of 2 if there is no child process
if (exitCode != 0 && exitCode > 2) {
logError(s"Process $cmd exited with code $exitCode: $output")
throw new SparkException(s"Process $cmd exited with code $exitCode")
}
input.close()
val childPids = new String(childPidsInByte.toArray, "UTF-8").split("\n")
val childPids = output.toString.split("\n")
val childPidsInInt = mutable.ArrayBuffer.empty[Int]
for (p <- childPids) {
if (p != "") {
logInfo("Found a child pid: " + p)
childPidsInInt += Integer.parseInt(p)
}
}
Expand All @@ -155,7 +160,7 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
}
}

def getProcessInfo(pid: Int): Unit = {
def computeProcessInfo(pid: Int): Unit = {
/*
* Hadoop ProcfsBasedProcessTree class used regex and pattern matching to retrive the memory
* info. I tried that but found it not correct during tests, so I used normal string analysis
Expand Down Expand Up @@ -188,7 +193,7 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
latestOtherRSSTotal += rssPages }
}
} catch {
case f: FileNotFoundException => log.debug("There was a problem with reading" +
case f: FileNotFoundException => logDebug("There was a problem with reading" +
" the stat file of the process", f)
}
}
Expand All @@ -210,7 +215,7 @@ private[spark] class ProcfsBasedSystems(val procfsDir: String = "/proc/") extend
latestOtherRSSTotal = 0
latestOtherVmemTotal = 0
for (p <- pids) {
getProcessInfo(p)
computeProcessInfo(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state used here is a little trickier than it needs to be.

computeProcessTree is updating a member variable, even though its only used locally -- it would be easier to follow if instead it just returned the process tree, and then you passed it around. Also I dont' think you actually care about the tree, just the set of pids?

similarly for allMetrics. it doesn't really need to be a member variable, since its use is entirely contained within this function, you could just pass it around.

val pids = discoverPids()
val allMetrics = ...
for (p <- pids) {
  allMetrics = updateMetricsForProcess(allMetrics, p)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tree was there in case we want to do some other stuff with it, but I guess we can have a tree structure when we actually need it. Right now as you mentioned we don't need it. So I will change it.
the allMetrics was there for testing, but I can change the test anyway.

}
ProcfsBasedSystemsMetrics(
getJVMVirtualMemInfo,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,13 @@ class ProcfsBasedSystemsSuite extends SparkFunSuite {
p.pageSize = 4096L

test("testGetProcessInfo") {
p.getProcessInfo(26109)
p.computeProcessInfo(26109)
assert(p.getJVMVirtualMemInfo == 4769947648L)
assert(p.getJVMRSSInfo == 262610944)
assert(p.getPythonVirtualMemInfo == 0)
assert(p.getPythonRSSInfo == 0)

p.getProcessInfo(22763)
p.computeProcessInfo(22763)
assert(p.getPythonVirtualMemInfo == 360595456)
assert(p.getPythonRSSInfo == 7831552)
assert(p.getJVMVirtualMemInfo == 4769947648L)
Expand Down