[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs #2786

chesterxgchen · 2014-10-13T18:56:13Z

Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs

When working with Spark with Yarn cluster mode, we have following issues:

We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed.
It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically.
Once job started, we would like some feedback from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed.
If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. This Pull Request will not complete solve the issue Removed reference to incubation in Spark user docs. #2, but it will allow expose Yarn Application status: such as when the job is started, killed, finished, the tracking URL etc, some limited progress reporting ( for CDH5 we found the progress only reports 0, 10 and 100%)

I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here.

If we decide to stop the spark job, the Spark Yarn Client expose a stop method. But the stop method, in many cases, does not stop the yarn application.

So we need to expose the yarn client's killApplication() API to spark client.

The proposed change is to change Client Constructor, change the first argument from ClientArguments to
YarnResourceCapacity => ClientArguments

Were YarnResourceCapacity contains yarn's max memory and virtual cores as well as overheads.

This allows application to adjust the memory and core settings accordingly.

For existing application that ignore the YarnResourceCapacity the

def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...)

We also defined the YarnApplicationListener interface that expose some of the information about YarnApplicationReport.

Client.addYarnApplicaitonListener(listerner)
will allow them to get call back at different state of the application, so they can react accordingly.

For example, onApplicationInit() the callback will invoked when the AppId is available but application is not yet started. Once can use this AppId to kill the application if the run is not longer desired.

Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs When working with Spark with Yarn cluster mode, we have following issues: 1) We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed. It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically. 2) Once job started, we would like some feedback from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed. If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. This Pull Request will not complete solve the issue #2, but it will allow expose Yarn Application status: such as when the job is started, killed, finished, the tracking URL etc, some limited progress reporting ( for CDH5 we found the progress only reports 0, 10 and 100%) I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here. 3) If we decide to stop the spark job, the Spark Yarn Client expose a stop method. But the stop method, in many cases, does not the yarn application. So we need to expose the yarn client's killApplication() API to spark client. The proposed change is to change Client Constructor, change the first argument from ClientArguments to YarnResourceCapacity => ClientArguments Were YarnResourceCapacity contains yarn's max memory and virtual cores as well as overheads. This allows application to adjust the memory and core settings accordingly. For existing application that ignore the YarnResourceCapacity the def toArgs (capacity: YarnResourceCapacity) = new ClientArguments(...) We also defined the YarnApplicationListener interface that expose some of the information about YarnApplicationReport. Client.addYarnApplicaitonListener(listern) will allow them to get call back at different state of the application, so they can react accordingly. For example, onApplicationInit() the callback will invoked when the AppId is available but application is not yet started. Once can use this AppId to kill the application if the run is not longer desired.

AmplabJenkins · 2014-10-13T18:57:13Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-10-13T21:02:16Z

Can one of the admins verify this patch?

witgo · 2014-10-14T03:39:22Z

Cool, this is a very good improvement.

tgravescs · 2014-11-06T15:05:17Z

what pull request are you referring to here?

"I will have another Pull Request to address the Yarn Application and Spark Job communication issue, that's not covered here."

The things you mention are all useful but perhaps I'm not seeing the bigger picture on how you view these being used? You added interface to addApplicationListener but how do you see a user doing that or how is that tied to the user?

chesterxgchen · 2014-11-06T22:11:26Z

Tom,

thanks for reviewing.

I am still working on the second PR, which I haven't submitted yet. The code is currently used in our application and I am pulling them out from our code and make a PR for it. The current code only uses Akka to do the communication, I would like to add the Netty support as well before I submit the Pull Request, that's why I haven't submit it yet.

The followings are the use cases in our application, which show how the new APIs are used. I assume other applications will have the similar use cases.

Our application doesn't use spark-submit command line to run spark. We submit both hadoop and spark job directly from our servlet application (jetty). We are deploying in Yarn Cluster Mode. We invoke the Spark Client ( in yarn module) directly. Client can't call System.exists, which will shutdown the jetty JVM.

Our application will submit and stop spark job, monitoring the spark job progress, get the states from the spark jobs ( for example, bad data counters ), logging and exceptions. So far the communication is one way (direction) after the job is submitted; we will move to two-ways communication soon. ( for example, a long running spark context, with different short spark actions: distinct counts, samplings, filters, transformations, etc. a pipeline of actions but need to be feedback on each action ( visualization etc.) )

In this particular Pull Request, we only address very limited requirements, the next PR will address the rest of communication mentioned above.

Get Yarn Container Capacities before submit Yarn Applications
In our Spark Job, we can use this callback in two ways:
A) we cap the request memory usage if the request is too large. For example, if the spark.executor.memory supplied by client is larger than the Yarn Container max memory, we reset the spark.executor.memory to yarn max container max memory minus over head and send a message to the Application ( UI message) tell the user that we reset the memory. Or we could simply throw exception without submit the job.
users might be use the information about virtual cores to do other validation.

B) We can dynamically estimate the executor memory based on data size ( if you have the information from prev processing steps) and max memory available; rather than directly use the fix memory size and potentially get kill if they are too large.
Add some callback via listener to monitoring Yarn application progress
We are using the listener call back to show progress ( event the information is limited)
A) We can get tracking URL from the yarn application listener call back. The URL allows client to go to the Hadoop Cluster Application management page directly if they need to check the job status

  As soon as the Yarn container is created and job is submitted, we have tracking URL from Yarn ( we need to watch out for invalid URL), at this point you can put the URL in the UI, even though the Spark job is not started yet.

B) We display the progress bar on the UI with the callback

For CDH5, we only got 0%, 10% and 100% from Yarn, not very useful, but still some earlier feedback to  customer.

C) We get the Yarn Application ID when the spark job is submitted, which can be used for tracking progress or kill the app.

( with next PR, we will be able to directly using the tracking URL to open the spark UI page, show spark job iterations and spark specific progress etc., currently all above are implemented in our application.)

expose Yarn Kill Application API
Yes, you can directly invoke from command line with
yarn kill -applicationId appId.
But since we need to call from our application, we need a API to do this. In our application, if client start the job and then decided to stop it ( running too long, change parameters etc.), we have to use kill API to kill it, as stop API doesn't stop it.

Hope this will give you a better picture as to why this PR is important to us.

I will move faster with next PR mentioned.

Thanks
Chester

tgravescs · 2014-11-10T14:56:20Z

Thanks for the explanation. A couple of things.

We don't officially support calling directly in to the yarn Client at this point.
I think I would like to see the second pr or an overall design before putting this piece in to see how things really fit together.

chesterxgchen · 2014-11-10T20:38:29Z

We don't officially support calling directly in to the yarn Client at this point.

So far, we haven't done this yet. As the communication is one-way push from Server to Application. But we won't like to do something in our next release of application. My next PR, would setup of the communication channel to enable this possibility.

I think I would like to see the second pr or an overall design before putting this piece in to see how things really fit together

Technically, the second part PR is not directly related to the this PR, event though we used both changes together in our Application.

In nutshell, the 2nd PR is simply the following:

When Submitting the Spark Job to Yarn, send the application host and port ( akka URL) in the arguments to the Client class;
In the spark job, try to connect to the Application with the host and port to established the handshake.
In our case, simply resolve the Akka actor via Akka Selection on given Akka URL.
Once the connection is reestablished, we can now send the message from a actor ( created from SparkContext's actor system) to the remote actor listener which listen to the message.
The spark job can be defined in a newly defined submain () method ( new trait) and exception throw can be directly caught and send as error message to the remote listener before re-throw again. The exception is relayed to Application side to stop the calling job and display on the UI
All logs can be redirected to both to listener and yarn container console ( using PrintStream re-direct to overwrite println)
create a new SparkJob Listener to catch the same job status in the spark UI to the listener, which then re-displace the progress and job iteration tasks in the application UI with spark tracking URL.
Other application stats (such as error counter) can be send as task message, so the application can update the application directly.

I am currently working on isolate the AKKA piece to so that the Netty can be used as the communication layer. In this way, large data size can be transferred. I will make it configurable for people to plugin other network protocols.

Due to our own release schedule, I was not able to work as fast as I hoped. But hope this give you the sense what's the overall PR is about.

chesterxgchen · 2015-02-06T16:47:25Z

Sorry, to take so long on this, as I went to working on Hadoop Kerberos authentication Implementation, so I did not get back to this until now.
I noticed that Sandy changed the Yarn module and consolidated the yarn-common/yarn-alpha and yarn-stable to yarn. My original PR is no longer compatible with new code base (1.3 or master). I have re-worked on this PR, I might merged the two into one. will update this soon.

AmplabJenkins · 2015-04-27T18:24:02Z

Can one of the admins verify this patch?

andrewor14 · 2015-06-18T23:49:27Z

@chesterxgchen thanks for working on this. It seems that major changes have gone into YARN between now and when this was last updated. Would you mind closing this patch for now since it's unlikely to be merged? Feel free to open an updated one against the same issue later and we can move the discussion there.

chesterxgchen · 2015-06-23T03:28:36Z

no problem, thanks.

On Mon, Jun 22, 2015 at 8:27 PM, asfgit [email protected] wrote:

Closed #2786 #2786 via c4d2343
c4d2343
.

—
Reply to this email directly or view it on GitHub
#2786 (comment).

asfgit closed this in c4d2343 Jun 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs #2786

[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs #2786

Uh oh!

chesterxgchen commented Oct 13, 2014

Uh oh!

AmplabJenkins commented Oct 13, 2014

Uh oh!

AmplabJenkins commented Oct 13, 2014

Uh oh!

witgo commented Oct 14, 2014

Uh oh!

tgravescs commented Nov 6, 2014

Uh oh!

chesterxgchen commented Nov 6, 2014

Uh oh!

tgravescs commented Nov 10, 2014

Uh oh!

chesterxgchen commented Nov 10, 2014

Uh oh!

chesterxgchen commented Feb 6, 2015

Uh oh!

AmplabJenkins commented Apr 27, 2015

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

chesterxgchen commented Jun 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs #2786

[SPARK-3913] Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and KillApplication APIs #2786

Uh oh!

Conversation

chesterxgchen commented Oct 13, 2014

Uh oh!

AmplabJenkins commented Oct 13, 2014

Uh oh!

AmplabJenkins commented Oct 13, 2014

Uh oh!

witgo commented Oct 14, 2014

Uh oh!

tgravescs commented Nov 6, 2014

Uh oh!

chesterxgchen commented Nov 6, 2014

Uh oh!

tgravescs commented Nov 10, 2014

Uh oh!

chesterxgchen commented Nov 10, 2014

Uh oh!

chesterxgchen commented Feb 6, 2015

Uh oh!

AmplabJenkins commented Apr 27, 2015

Uh oh!

andrewor14 commented Jun 18, 2015

Uh oh!

chesterxgchen commented Jun 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants