Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This PR adds two new pages to the Spark Web UI:

  • A jobs overview page, which shows details on running / completed / failed jobs.
  • A job details page, which displays information on an individual job's stages.

The jobs overview page is now the default UI homepage; the old homepage is still accessible at /stages.

Screenshots

New UI homepage

image

Job details page

(This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)

image

Key changes in this PR

  • Rename JobProgressPage to AllStagesPage
  • Expose StageInfo objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.
  • Add additional data structures to JobProgressListener to map from stages to jobs.
  • Add several fields to JobUIData.

I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.

Limitations

If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.

If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.

This will avoid confusion once we have a page that lists all jobs.
This PR adds two new pages to the Spark Web UI:

- A jobs overview page, which shows details on running / completed / failed
  jobs.
- A job details page, which displays information on an individual job's stages.

The jobs overview page is now the default UI homepage; the old homepage is
still accessible at /stages.

In some situations, the jobs page may display "No information on stage" for
some stages of pending jobs that have not begun executing.  This is due to some
limitations in how JobProgressListener finds out about stages.  We can address
this later as part of a separate scheduler PR.
@JoshRosen JoshRosen changed the title [SPARK-4145] [WIP] Web UI job pages [SPARK-4145] Web UI job pages Oct 30, 2014
@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22506 has finished for PR 3009 at commit 4487dcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

@pwendell @andrewor14 @kayousterhout Could one of you review this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be updated to say "jobs" instead of "stages" right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@kayousterhout
Copy link
Contributor

@JoshRosen this looks great!! One thing that seems missing here is some kind of status bar on the jobs page. I know there was a bunch of debate on what an appropriate status bar would be; can you do something simple, like completed stages / total stages?

@JoshRosen
Copy link
Contributor Author

@kayousterhout I like the progress bar idea and I'm working on implementing it now.

@JoshRosen
Copy link
Contributor Author

Actually, there's one subtlety: a single stage might be shared by multiple jobs, so when we get a StageCompletion event we don't know which job triggered that stage. Therefore, we don't know which jobInfo structure to update.

I do have a job -> stage mapping, so I suppose that I could use this to query the stageInfo structures to see whether they've succeeded or failed, but this risks weird behavior where a recomputation of a stage in a later job causes an earlier job's progress bar to regress. Is there a simple solution here?

@kayousterhout
Copy link
Contributor

Hm this seems like a more general problem right? Like what if you have:

Job A:

RDD 1 ----(shuffle dependency)----> RDD 2 ----(second shuffle dependency)--> RDD 3

Job B:

RDD 1 ----(shuffle dependency)----> RDD 4

Now say:
-RDD 1 finishes successfully. both job UI pages get updated.
-RDD2 in job A finishes successfully; Job A's page gets updated
-one of the machines crashes, causing a fetch failure in RDD 4. Now the stage for RDD 1 needs to be recomputed.

The re-computation should really only show up in Job B's UI, because the re-computation is no longer relevant to job A (it's home free because RDD2 already finished).

Is this uncommon enough that we should just leave the weird re-computation behavior?

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22965 has finished for PR 3009 at commit 4b206fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22966 has finished for PR 3009 at commit 45343b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

One approach, which I'm exploring now, is to show a combined progress bar that just tracks "total number of tasks" summed across all stages involved in that job. To prevent weird "progress bar travels back in time" issues, I'm planning to extend JobUIData to track these task counts. I'll add a stage attempt -> job id mapping to track which job is associated with the current stage and use this to figure out which counters need to be updated when tasks finish.

@kayousterhout
Copy link
Contributor

This sounds great. I do think it would be good to get this job page change
in ASAP, since it's a major, user-visible change and it would be good to
have it in master so folks can play with it for a few weeks before the
release is cut (assuming you're trying to get this in 1.2?). This is
already definitely pushing it, I think, in terms of what should be merged
in late in the release cycle. This is all to say that, at this point, I
think it might be better to get this in as-is (without the progress bar)
and add the progress bar later.

On Mon, Nov 10, 2014 at 7:42 PM, Josh Rosen [email protected]
wrote:

One approach, which I'm exploring now, is to show a combined progress bar
that just tracks "total number of tasks" summed across all stages involved
in that job. To prevent weird "progress bar travels back in time" issues,
I'm planning to extend JobUIData to track these task counts. I'll add a
stage attempt -> job id mapping to track which job is associated with the
current stage and use this to figure out which counters need to be updated
when tasks finish.


Reply to this email directly or view it on GitHub
#3009 (comment).

@JoshRosen
Copy link
Contributor Author

Agreed. This change actually involves some potentially costly listener operations, so it would be great to get this in soon so it can get lots of testing / review.

I'm actually working on this now and plan to push an updated commit before I go to bed; I just need to test it out once locally.

@JoshRosen
Copy link
Contributor Author

Alright, added a rough first cut at progress bars. This code is functional, although it could be cleaner. I'll take another look tomorrow when I'm fresh.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23209 has finished for PR 3009 at commit a475ea1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming change is great

@kayousterhout
Copy link
Contributor

@shivaram and I played around with this a bit and, after looking at it, having the total number of tasks as the only progress bar is somewhat unintuitive. I'd propose two changes: (1) add a second progress bar, that shows up to the left of the total tasks bar, that shows "Stages: Succeeded / Total". (2) For the tasks progress bar, change the title to say "Tasks (for all stages): Succeeded / Total". It will likely spill to two lines then, but that seems worthwhile to make this a little more clear.

One other suggestion: I wonder if it would be worthwhile to add a very short description at the top of the main jobs page that says something like 'A job is triggered by a action, like "count()" or "saveAsTextFile()". Click on a job's title to see information about the stages of tasks associated with the job.' Maybe this is too much clutter -- but I think it might be nice to clue people in to what they're looking at (even the wise @shivaram pointed out that he wasn't quite sure what a job was when we were looking at this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the code to compute the last stage's name up here? It was hard for me to figure out why you were grabbing the last stage here.

@kayousterhout
Copy link
Contributor

Also this seems broken? (the stage should be in "Active Stages")

image

@JoshRosen
Copy link
Contributor Author

Alright, I pushed that final cleanup commit. @andrewor14, want to take a final look on the JsonProtocol backwards-compatibility stuff?

@JoshRosen JoshRosen changed the title [SPARK-4145] [WIP] Web UI job pages [SPARK-4145] Web UI job pages Nov 21, 2014
@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23698 has finished for PR 3009 at commit 6f17f3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

@JoshRosen I believe this is failing tests

@JoshRosen
Copy link
Contributor Author

@pwendell Yep, it looks like a legitimate failure in ReplayListenerSuite: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23698/testReport/

I'm digging in now to understand the error message. It looks like it's failing this assertion:

    val originalEvents = sc.eventLogger.get.loggedEvents
    val replayedEvents = eventMonster.loggedEvents
    originalEvents.zip(replayedEvents).foreach { case (e1, e2) => assert(e1 === e2) }

I wonder if this is due to that StageInfo.equals() issue that I mentioned earlier.

@JoshRosen
Copy link
Contributor Author

Ah, spotted the problem: I forgot to remove the line that wrote the Stage Ids JSON field, so this was mistakenly causing the read path to treat data written in the new format as though it was written using the old one.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23705 has finished for PR 3009 at commit ff804cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

This skipped thing looks great -- I withdraw my -0.5 (which I didn't realize meant this couldn't get merged into 1.2...didn't realize code voting was different than release voting) and am fine to merge this in! Did not do another detailed look at this code since it seems like Andrew had a close look.

Thanks for all of the hard work on this Josh!

@JoshRosen
Copy link
Contributor Author

Argh, not again! That's what I get for playing whackamole with individual test suites without running all of them...

I've spotted the cause behind this latest test failure and I'm fixing it now.

The root problem was that I removed a field from the JSON and recomputed
it from another field.  In order for the backwards-compatibility test
to work, I needed to manually re-add the removed field in order to
construct JSON that’s in the right (old) format.
@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23714 has finished for PR 3009 at commit b89c258.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2014

Test build #23735 has finished for PR 3009 at commit f00c851.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

@pwendell @andrewor14 I fixed the JsonProtocol compatibility issues that we discussed and added a note on compatibility guarantees; it would be great if you could take another look.

@JoshRosen
Copy link
Contributor Author

Just noticed that the "Completed Stages" table still shows a "Kill" button:

image

This is the case on the stages page, too. Do you know if this was a bug in the existing web UI or whether there's a reason for this (maybe related to re-computed stages being cancellable)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we should start adding forward compatibility tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I've filed https://issues.apache.org/jira/browse/SPARK-4555 for this.

@andrewor14
Copy link
Contributor

Hey @JoshRosen JSON changes LGTM.

@pwendell
Copy link
Contributor

Took a look at the new JSON stuff and it LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

killEnabled should be set to false here (that's why you're seeing the kill button)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or don't set it -- as below -- the default is false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I now see that this is actually a change / regression from the 1.1.0 UI; I'll fix this up now.

@SparkQA
Copy link

SparkQA commented Nov 24, 2014

Test build #23773 has finished for PR 3009 at commit eb05e90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

Okay thanks everyone who worked on this. I'm going to pull it in to make sure it gets into the next preview release.

@asfgit asfgit closed this in 4a90276 Nov 24, 2014
asfgit pushed a commit that referenced this pull request Nov 24, 2014
This PR adds two new pages to the Spark Web UI:

- A jobs overview page, which shows details on running / completed / failed jobs.
- A job details page, which displays information on an individual job's stages.

The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`.

### Screenshots

#### New UI homepage

![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png)

#### Job details page

(This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)

![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png)

### Key changes in this PR

- Rename `JobProgressPage` to `AllStagesPage`
- Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.
- Add additional data structures to `JobProgressListener` to map from stages to jobs.
- Add several fields to `JobUIData`.

I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.

### Limitations

If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.

If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.

Author: Josh Rosen <[email protected]>

Closes #3009 from JoshRosen/job-page and squashes the following commits:

eb05e90 [Josh Rosen] Disable kill button in completed stages tables.
f00c851 [Josh Rosen] Fix JsonProtocol compatibility
b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes.
ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON.
6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event.
2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages.
61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables.
1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback.
0b77e3e [Josh Rosen] More bug fixes for phantom stages.
034aa8d [Josh Rosen] Use `.max()` to find result stage for job.
eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs.
67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks.
7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page
d69c775 [Josh Rosen] Fix table sorting on all jobs page.
5eb39dc [Josh Rosen] Add pending stages table to job page.
f2a15da [Josh Rosen] Add status field to job details page.
171b53c [Josh Rosen] Move `startTime` to the start of SparkContext.
e2f2c43 [Josh Rosen] Fix sorting of stages in job details page.
8955f4c [Josh Rosen] Display information for pending stages on jobs page.
8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos.
5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event.
79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur.
d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue.
1145c60 [Josh Rosen] Display text instead of progress bar for stages.
3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page.
b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed.
4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups.
4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)"
85e9c85 [Josh Rosen] Extract startTime into separate variable.
1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions.
56701fa [Josh Rosen] Move last stage name / description logic out of markup.
a475ea1 [Josh Rosen] Add progress bars to jobs page.
45343b8 [Josh Rosen] More comments
4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
bfce2b9 [Josh Rosen] Address review comments, except for progress bar.
4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages
2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage:

(cherry picked from commit 4a90276)
Signed-off-by: Patrick Wendell <[email protected]>
@JoshRosen JoshRosen deleted the job-page branch October 24, 2015 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants