Skip to content

Conversation

@caneGuy
Copy link
Contributor

@caneGuy caneGuy commented Sep 5, 2017

…NING

What changes were proposed in this pull request?

When driver quit abnormally which cause executor shutdown and task metrics can not be sent to driver for updating.In this case the status will always be 'RUNNING' and the duration on history UI will be 'CurrentTime - launchTime' which increase infinitely.
We can fix this time by modify time of event log since this time has gotten when FSHistoryProvider fetch event log from File System.
And the result picture is uploaded in SPARK-21922.
How to reproduce?
(1) Submit a job to spark on yarn
(2) Mock an oom(or other case can make driver quit abnormally) senario for driver
(3) Make sure executor is running task when driver quitting
(4) Open the history server and checkout result
It is not a corner case since there are many such jobs in our current cluster.

How was this patch tested?

Deploy historyserver and open a job has this problem.

@ajbozarth
Copy link
Member

Though this may fix the problem it is 100% the wrong way to do it for multiple reasons:

  1. The UI should not touch the FS, FS access it abstracted out so the UI is can work with multiple FS impls
  2. This fix assumes it's running the History Server in a class used by both the Web UI and SHS

Without a deeper look at the issue I would instead look at why the problem is occurring and fixing that rather than putting a bandaid on the UI like this. (Note: I've done this before and got the same response, nothing personally :)

I would recommend closing this and opening a new PR once you figure out a new solution.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 6, 2017

Thanks for your recommendation @ajbozarth .Could you put a link for your pr?For the problem you mentioned,i have thought about them.
1、FsHistoryServer will always use FS to get event log
2、For spark ui, my implementation will not access FS.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 6, 2017

@ajbozarth I have updated the implementation which only access FS in FSHistoryServerProvider

Copy link
Member

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update, this is a much better implementation, just one nit otherwise LGTM

attempt = uiData.taskInfo.attemptNumber,
launchTime = new Date(uiData.taskInfo.launchTime),
duration = uiData.taskDuration,
duration = uiData.taskDuration(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated and unnecessary

Copy link
Contributor Author

@caneGuy caneGuy Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since i have changed taskDuration below,if do not add () here a compiler error will occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd, I thought for sure it'd be fine, then this LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ajbozarth

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here what if we call the REST API on history server to get stage info? Looks like we may still have this issue since we don't have last update time here, what do you think @ajbozarth ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, @jerryshao .IIUC, the ui in AllStagesResource.scala is passed from ApiRootResource which also create sparkUI by FSHistoryProvider.So we can also get lastUpdateTime from this ui in AllStagesResource and pass to the taskDuration interface.I think it is another problem for REST?Should we fix here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it is not a big change I think it should be fixed here. Because currently with this fix UI and REST API are inconsistent.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 6, 2017

Ping @kiszk @jerryshao can you help reivew this?

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 8, 2017

@vanzin @zsxwing could you help reivew this?Thanks.

@jerryshao
Copy link
Contributor

Is this a problem only in History UI, or it also has issues in Live UI? From my understanding you only pass a last update time for history UI, so is it intended?

Also you mentioned "When executor failed and task metrics have not send to driver,the status will always be 'RUNNING'", is this a bug in scheduler?

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 8, 2017

It just appear on history ui since Live ui will use heartbeat metric result @jerryshao(From my understanding) .It appears when Driver exit unnormally,then status of taskinfo can not be updated correctlly.

@jerryshao
Copy link
Contributor

So your PR description is quite confusing, would you please elaborate your problem in detail and describe how to reproduce the issue.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 11, 2017

Done @jerryshao

@jerryshao
Copy link
Contributor

Still I have a question about history server, is you event log an incomplete event log or completed when you met such issue?

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 11, 2017

@jerryshao Thanks for your time. IIUC, event log is completed since driver has not dropped any event of executor which has problem described above.See below,driver only drop two events after shutting down:
2017-09-11,17:08:57,605 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(300,WrappedArray()) 2017-09-11,17:08:57,623 INFO org.apache.spark.deploy.yarn.YarnAllocator: Driver requested a total number of 0 executor(s). 2017-09-11,17:08:57,624 INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down all executors 2017-09-11,17:08:57,624 INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 2017-09-11,17:08:57,629 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(258,WrappedArray())

And below showing an executor has this problem:
running
running02

Case here is a job running today of our cluster.

@jerryshao
Copy link
Contributor

ok to test.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make sure everything is OK in Live UI?

val operationGraphListener: RDDOperationGraphListener,
var appName: String,
val basePath: String,
val lastUpdateTime: Long = -1L,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to user Option[Long] = None as default value to reflect there's no update time.

Copy link
Contributor Author

@caneGuy caneGuy Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated @jerryshao Thanks for your time.

@jerryshao
Copy link
Contributor

Looks like I don't have the Jenkins permission to trigger UT 😞 . Let me ping @srowen to trigger the test.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 12, 2017

Yes i have confirmed.Below is a testing job: @jerryshao
_001
_002
_003

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

@jerryshao, for triggering tests on Jenkins, I think this should be added by its admin manually as well if I understood correctly. In my case, I asked this to Josh Rosen before via email privately. I am quite sure you are facing the same issue I (and Holden, Felix and Takuya) met before if I understood correctly.

@jerryshao
Copy link
Contributor

Thanks @HyukjinKwon , I will ping Josh about this thing 😄 .

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81678 has finished for PR 19132 at commit 25fe22c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 12, 2017

@jerryshao @HyukjinKwon Jenkins done.Thanks too much.

@jerryshao
Copy link
Contributor

Overall LGTM, @ajbozarth can you please review again?

@ajbozarth
Copy link
Member

I've been following, still LGTM

} yield {
AllStagesResource.stageUiToStageData(status, stageInfo, stageUiData, includeDetails = false)
AllStagesResource.stageUiToStageData(
status, stageInfo, stageUiData, includeDetails = false, Some(ui))
Copy link
Contributor

@jerryshao jerryshao Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good design to pass in SparkUI to only get lastUpdate, the API looks weird to add this SparkUI argument, the fix here only just make it work. It is better to add one more field in StageUIData or TaskUIData if possible.

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81709 has finished for PR 19132 at commit 560d442.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2017

Test build #81710 has finished for PR 19132 at commit f0e1ded.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var memoryBytesSpilled: Long = _
var diskBytesSpilled: Long = _
var isBlacklisted: Int = _
var jobLastUpdateTime: Option[Long] = None
Copy link
Contributor

@jerryshao jerryshao Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to rename to stageLastUpdateTime or just lastUpdateTime? Since this structure is unrelated to job, would be better to not involve "job".

@DefaultValue("ID") @QueryParam("sortBy") sortBy: TaskSorting): Seq[TaskData] = {
withStageAttempt(stageId, stageAttemptId) { stage =>
val tasks = stage.ui.taskData.values.map{AllStagesResource.convertTaskData}.toIndexedSeq
val tasks = stage.ui.taskData.values.map{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style should be changed to map { AllStagesResource.convertTaskData(_, ui.lastUpdateTime) }, requires whitespace between { and }. You could check other similar codes about the style.

val taskData = if (includeDetails) {
Some(stageUiData.taskData.map { case (k, v) => k -> convertTaskData(v) } )
Some(stageUiData.taskData.map { case (k, v) =>
k -> convertTaskData(v, stageUiData.lastUpdateTime) } )
Copy link
Contributor

@jerryshao jerryshao Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not stated clearly, it should be:

Some(stageUiData.taskData.map { case (k, v) =>
  k -> convertTaskData(v, stageUiData.lastUpdateTime) })

No whitespace after "}".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patient.Actually i did not check carefully.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81764 has finished for PR 19132 at commit 8ec4319.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81768 has finished for PR 19132 at commit 06035f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

LGTM, merging to master, if possible to 2.2.

@asfgit asfgit closed this in 4b88393 Sep 14, 2017
@jerryshao
Copy link
Contributor

Cannot cleanly merge to 2.2, so this will only land to master branch.

@caneGuy
Copy link
Contributor Author

caneGuy commented Sep 14, 2017

@jerryshao Thanks.

@caneGuy caneGuy deleted the zhoukang/fix-duration branch September 25, 2017 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants