-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32287][TESTS] Flaky Test: ExecutorAllocationManagerSuite.add executors default profile #29225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // SPARK-22864: effectively disable the allocation schedule by setting the period to a | ||
| // really long value. | ||
| .set(TEST_SCHEDULE_INTERVAL, 10000L) | ||
| .set(TEST_SCHEDULE_INTERVAL, 30000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is for the same reason with SPARK-22864 at line 1604?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs It happens again at: https://github.com/apache/spark/pull/29418/checks?check_run_id=975304054
BTW, this change only prevents the automatic invocation of schedule() for the second time but the first time invocation of schedule() always happens because the initialDelay is 0?
| executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing. a test report in that run?
Yes but it seemed like the failures we were seeing before we after an interation or two or the test starting, although it did vary. If we can see the logs from it we should be able to see timing and possibly some log messages to tell us if that is the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing. a test report in that run?
You need to scroll to the end of Run tests: core, unsafe, kvstore, avro section to see the failed tests.
Yes but it seemed like the failures we were seeing before we after an interation or two or the test starting, although it did vary.
Ok, not aware of that...
If we can see the logs from it we should be able to see timing and possibly some log messages to tell us if that is the problem.
@tgravescs A feasible way to debug it is to open a PR in your own fork repository and adds some println in the source code.
@HyukjinKwon I think we still can not see the complete logs in GithubActions like the unit-test.log in Jenkins. I mean, even we download the archive logs after checks finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can try to upload that specific log file as an artifact (so we can download) Let me take a look in coming few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the unit-tests.log file is generated by log4j:
| log4j.appender.file.file=target/unit-tests.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating and uploading the logs, let me know if you see a build that has this test failure and I'll look at the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM since this is test-only configuration and already is supposed to have a really long value since 2.3.0.
Also, GitHub Action is passed. Merged to master/3.0.
…xecutors default profile I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds. test failure no unit test Closes #29225 from tgravescs/SPARK-32287. Authored-by: Thomas Graves <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e6ef27b) Signed-off-by: Dongjoon Hyun <[email protected]>
|
thanks @dongjoon-hyun if you see this failing again let me know and I'll try to reproduce again. I ran it in a subset of the tests in a loop and wasn't able to reproduce locally. |
|
Sure~ I'll monitor on GitHub Action result. Thanks. |
|
Test build #126506 has finished for PR 29225 at commit
|
|
Late LGTM. Thank you @tgravescs |
|
Thank you @tgravescs for fixing this. |
|
I saw this again: with: at https://github.com/apache/spark/pull/29278/checks?check_run_id=930261611. I am retriggering. I will monitor a bit more and update here. If the flakiness is very rare, it would be fine for now. |
|
is there anyway to access the unit test detailed logs from GitHub action? |
|
It would be really nice if we could get some timestamps on how long tests were taking |
|
Oh it shows the timstamps when you download the log. I'll share the timestamps too later when I happene to see this again. |
### What changes were proposed in this pull request? This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:  ### Why are the changes needed? Jenkins has this feature. It should be best to have the same dev functionalities with it. Also, note that this was pointed out #29225 (comment). ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? https://github.com/apache/spark/actions/runs/213000777 should demonstrate it Closes #29454 from HyukjinKwon/SPARK-32645. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:  ### Why are the changes needed? Jenkins has this feature. It should be best to have the same dev functionalities with it. Also, note that this was pointed out apache#29225 (comment). ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? https://github.com/apache/spark/actions/runs/213000777 should demonstrate it Closes apache#29454 from HyukjinKwon/SPARK-32645. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:  ### Why are the changes needed? Jenkins has this feature. It should be best to have the same dev functionalities with it. Also, note that this was pointed out apache#29225 (comment). ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? https://github.com/apache/spark/actions/runs/213000777 should demonstrate it Closes apache#29454 from HyukjinKwon/SPARK-32645. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:  ### Why are the changes needed? Jenkins has this feature. It should be best to have the same dev functionalities with it. Also, note that this was pointed out apache#29225 (comment). ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? https://github.com/apache/spark/actions/runs/213000777 should demonstrate it Closes apache#29454 from HyukjinKwon/SPARK-32645. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
so @Ngone51 found this again here: for the flaky ExecutorAllocationManagerSuite : https://github.com/apache/spark/pull/29452/checks?check_run_id=1008962797 entire test suite ran in less then 5 seconds The logs that I see are fine up til it fails but the output is intermixed with other test Suites running so its hard to differentiate some of the logs. The only path I see this can changed would be on a decrement triggered by an updateAndSync and the only way it should hit that is if the timer fired. So it might be the initial timer as stated above. I'll put up a PR with some more debugging enabled to see if I can verify it. |
|
Thank you @tgravescs a lot for taking a look for this. |
What changes were proposed in this pull request?
I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds.
Why are the changes needed?
test failure
Does this PR introduce any user-facing change?
no
How was this patch tested?
unit test