-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4145] Web UI job pages #3009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will avoid confusion once we have a page that lists all jobs.
This PR adds two new pages to the Spark Web UI: - A jobs overview page, which shows details on running / completed / failed jobs. - A job details page, which displays information on an individual job's stages. The jobs overview page is now the default UI homepage; the old homepage is still accessible at /stages. In some situations, the jobs page may display "No information on stage" for some stages of pending jobs that have not begun executing. This is due to some limitations in how JobProgressListener finds out about stages. We can address this later as part of a separate scheduler PR.
|
Test build #22506 has finished for PR 3009 at commit
|
|
@pwendell @andrewor14 @kayousterhout Could one of you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be updated to say "jobs" instead of "stages" right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
|
@JoshRosen this looks great!! One thing that seems missing here is some kind of status bar on the jobs page. I know there was a bunch of debate on what an appropriate status bar would be; can you do something simple, like completed stages / total stages? |
|
@kayousterhout I like the progress bar idea and I'm working on implementing it now. |
|
Actually, there's one subtlety: a single stage might be shared by multiple jobs, so when we get a StageCompletion event we don't know which job triggered that stage. Therefore, we don't know which I do have a job -> stage mapping, so I suppose that I could use this to query the stageInfo structures to see whether they've succeeded or failed, but this risks weird behavior where a recomputation of a stage in a later job causes an earlier job's progress bar to regress. Is there a simple solution here? |
|
Hm this seems like a more general problem right? Like what if you have: Job A: RDD 1 ----(shuffle dependency)----> RDD 2 ----(second shuffle dependency)--> RDD 3 Job B: RDD 1 ----(shuffle dependency)----> RDD 4 Now say: The re-computation should really only show up in Job B's UI, because the re-computation is no longer relevant to job A (it's home free because RDD2 already finished). Is this uncommon enough that we should just leave the weird re-computation behavior? |
|
Test build #22965 has finished for PR 3009 at commit
|
|
Test build #22966 has finished for PR 3009 at commit
|
|
One approach, which I'm exploring now, is to show a combined progress bar that just tracks "total number of tasks" summed across all stages involved in that job. To prevent weird "progress bar travels back in time" issues, I'm planning to extend |
|
This sounds great. I do think it would be good to get this job page change On Mon, Nov 10, 2014 at 7:42 PM, Josh Rosen [email protected]
|
|
Agreed. This change actually involves some potentially costly listener operations, so it would be great to get this in soon so it can get lots of testing / review. I'm actually working on this now and plan to push an updated commit before I go to bed; I just need to test it out once locally. |
|
Alright, added a rough first cut at progress bars. This code is functional, although it could be cleaner. I'll take another look tomorrow when I'm fresh. |
|
Test build #23209 has finished for PR 3009 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming change is great
|
@shivaram and I played around with this a bit and, after looking at it, having the total number of tasks as the only progress bar is somewhat unintuitive. I'd propose two changes: (1) add a second progress bar, that shows up to the left of the total tasks bar, that shows "Stages: Succeeded / Total". (2) For the tasks progress bar, change the title to say "Tasks (for all stages): Succeeded / Total". It will likely spill to two lines then, but that seems worthwhile to make this a little more clear. One other suggestion: I wonder if it would be worthwhile to add a very short description at the top of the main jobs page that says something like 'A job is triggered by a action, like "count()" or "saveAsTextFile()". Click on a job's title to see information about the stages of tasks associated with the job.' Maybe this is too much clutter -- but I think it might be nice to clue people in to what they're looking at (even the wise @shivaram pointed out that he wasn't quite sure what a job was when we were looking at this). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the code to compute the last stage's name up here? It was hard for me to figure out why you were grabbing the last stage here.
|
Alright, I pushed that final cleanup commit. @andrewor14, want to take a final look on the JsonProtocol backwards-compatibility stuff? |
|
Test build #23698 has finished for PR 3009 at commit
|
|
@JoshRosen I believe this is failing tests |
|
@pwendell Yep, it looks like a legitimate failure in ReplayListenerSuite: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23698/testReport/ I'm digging in now to understand the error message. It looks like it's failing this assertion: I wonder if this is due to that |
|
Ah, spotted the problem: I forgot to remove the line that wrote the |
|
Test build #23705 has finished for PR 3009 at commit
|
|
This skipped thing looks great -- I withdraw my -0.5 (which I didn't realize meant this couldn't get merged into 1.2...didn't realize code voting was different than release voting) and am fine to merge this in! Did not do another detailed look at this code since it seems like Andrew had a close look. Thanks for all of the hard work on this Josh! |
|
Argh, not again! That's what I get for playing whackamole with individual test suites without running all of them... I've spotted the cause behind this latest test failure and I'm fixing it now. |
The root problem was that I removed a field from the JSON and recomputed it from another field. In order for the backwards-compatibility test to work, I needed to manually re-add the removed field in order to construct JSON that’s in the right (old) format.
|
Test build #23714 has finished for PR 3009 at commit
|
|
Test build #23735 has finished for PR 3009 at commit
|
|
@pwendell @andrewor14 I fixed the JsonProtocol compatibility issues that we discussed and added a note on compatibility guarantees; it would be great if you could take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we should start adding forward compatibility tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; I've filed https://issues.apache.org/jira/browse/SPARK-4555 for this.
|
Hey @JoshRosen JSON changes LGTM. |
|
Took a look at the new JSON stuff and it LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
killEnabled should be set to false here (that's why you're seeing the kill button)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or don't set it -- as below -- the default is false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I now see that this is actually a change / regression from the 1.1.0 UI; I'll fix this up now.
|
Test build #23773 has finished for PR 3009 at commit
|
|
Okay thanks everyone who worked on this. I'm going to pull it in to make sure it gets into the next preview release. |
This PR adds two new pages to the Spark Web UI: - A jobs overview page, which shows details on running / completed / failed jobs. - A job details page, which displays information on an individual job's stages. The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`. ### Screenshots #### New UI homepage  #### Job details page (This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)  ### Key changes in this PR - Rename `JobProgressPage` to `AllStagesPage` - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol. - Add additional data structures to `JobProgressListener` to map from stages to jobs. - Add several fields to `JobUIData`. I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch. ### Limitations If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%. If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work. Author: Josh Rosen <[email protected]> Closes #3009 from JoshRosen/job-page and squashes the following commits: eb05e90 [Josh Rosen] Disable kill button in completed stages tables. f00c851 [Josh Rosen] Fix JsonProtocol compatibility b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes. ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON. 6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event. 2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages. 61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables. 1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback. 0b77e3e [Josh Rosen] More bug fixes for phantom stages. 034aa8d [Josh Rosen] Use `.max()` to find result stage for job. eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs. 67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks. 7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page d69c775 [Josh Rosen] Fix table sorting on all jobs page. 5eb39dc [Josh Rosen] Add pending stages table to job page. f2a15da [Josh Rosen] Add status field to job details page. 171b53c [Josh Rosen] Move `startTime` to the start of SparkContext. e2f2c43 [Josh Rosen] Fix sorting of stages in job details page. 8955f4c [Josh Rosen] Display information for pending stages on jobs page. 8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos. 5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event. 79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur. d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue. 1145c60 [Josh Rosen] Display text instead of progress bar for stages. 3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page 8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page. b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed. 4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups. 4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)" 85e9c85 [Josh Rosen] Extract startTime into separate variable. 1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions. 56701fa [Josh Rosen] Move last stage name / description logic out of markup. a475ea1 [Josh Rosen] Add progress bars to jobs page. 45343b8 [Josh Rosen] More comments 4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page bfce2b9 [Josh Rosen] Address review comments, except for progress bar. 4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages 2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage: (cherry picked from commit 4a90276) Signed-off-by: Patrick Wendell <[email protected]>


This PR adds two new pages to the Spark Web UI:
The jobs overview page is now the default UI homepage; the old homepage is still accessible at
/stages.Screenshots
New UI homepage
Job details page
(This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)
Key changes in this PR
JobProgressPagetoAllStagesPageStageInfoobjects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.JobProgressListenerto map from stages to jobs.JobUIData.I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.
Limitations
If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.
If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.