Skip to content

Conversation

@tgravescs
Copy link
Contributor

What changes were proposed in this pull request?

I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds.

Why are the changes needed?

test failure

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit test

@tgravescs
Copy link
Contributor Author

@Ngone51

// SPARK-22864: effectively disable the allocation schedule by setting the period to a
// really long value.
.set(TEST_SCHEDULE_INTERVAL, 10000L)
.set(TEST_SCHEDULE_INTERVAL, 30000L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is for the same reason with SPARK-22864 at line 1604?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs It happens again at: https://github.com/apache/spark/pull/29418/checks?check_run_id=975304054

BTW, this change only prevents the automatic invocation of schedule() for the second time but the first time invocation of schedule() always happens because the initialDelay is 0?

executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing. a test report in that run?
Yes but it seemed like the failures we were seeing before we after an interation or two or the test starting, although it did vary. If we can see the logs from it we should be able to see timing and possibly some log messages to tell us if that is the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing. a test report in that run?

You need to scroll to the end of Run tests: core, unsafe, kvstore, avro section to see the failed tests.

Yes but it seemed like the failures we were seeing before we after an interation or two or the test starting, although it did vary.

Ok, not aware of that...

If we can see the logs from it we should be able to see timing and possibly some log messages to tell us if that is the problem.

@tgravescs A feasible way to debug it is to open a PR in your own fork repository and adds some println in the source code.

@HyukjinKwon I think we still can not see the complete logs in GithubActions like the unit-test.log in Jenkins. I mean, even we download the archive logs after checks finished.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can try to upload that specific log file as an artifact (so we can download) Let me take a look in coming few days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the unit-tests.log file is generated by log4j:

log4j.appender.file.file=target/unit-tests.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating and uploading the logs, let me know if you see a build that has this test failure and I'll look at the log

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM since this is test-only configuration and already is supposed to have a really long value since 2.3.0.
Also, GitHub Action is passed. Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Jul 24, 2020
…xecutors default profile

I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds.

test failure

no

unit test

Closes #29225 from tgravescs/SPARK-32287.

Authored-by: Thomas Graves <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e6ef27b)
Signed-off-by: Dongjoon Hyun <[email protected]>
@tgravescs
Copy link
Contributor Author

thanks @dongjoon-hyun if you see this failing again let me know and I'll try to reproduce again. I ran it in a subset of the tests in a loop and wasn't able to reproduce locally.

@dongjoon-hyun
Copy link
Member

Sure~ I'll monitor on GitHub Action result. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126506 has finished for PR 29225 at commit 3af6ab9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member

Ngone51 commented Jul 27, 2020

Late LGTM. Thank you @tgravescs

@HyukjinKwon
Copy link
Member

Thank you @tgravescs for fixing this.

@HyukjinKwon
Copy link
Member

I saw this again:

[info] - add executors default profile *** FAILED *** (38 milliseconds)
[info]   4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)

with:

[info] - add executors capped by num pending tasks *** FAILED *** (77 milliseconds)
[info]   6 did not equal 5 (ExecutorAllocationManagerSuite.scala:428)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$18(ExecutorAllocationManagerSuite.scala:428)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:158)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)

at https://github.com/apache/spark/pull/29278/checks?check_run_id=930261611. I am retriggering.

I will monitor a bit more and update here. If the flakiness is very rare, it would be fine for now.

@tgravescs
Copy link
Contributor Author

is there anyway to access the unit test detailed logs from GitHub action?

@tgravescs
Copy link
Contributor Author

It would be really nice if we could get some timestamps on how long tests were taking

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 1, 2020

Oh it shows the timstamps when you download the log. I'll share the timestamps too later when I happene to see this again.

HyukjinKwon added a commit that referenced this pull request Aug 19, 2020
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out #29225 (comment).

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes #29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Aug 19, 2020
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out apache#29225 (comment).

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes apache#29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Aug 19, 2020
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out apache#29225 (comment).

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes apache#29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Aug 19, 2020
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out apache#29225 (comment).

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes apache#29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@tgravescs
Copy link
Contributor Author

so @Ngone51 found this again here:

for the flaky ExecutorAllocationManagerSuite : https://github.com/apache/spark/pull/29452/checks?check_run_id=1008962797

entire test suite ran in less then 5 seconds

The logs that I see are fine up til it fails but the output is intermixed with other test Suites running so its hard to differentiate some of the logs. The only path I see this can changed would be on a decrement triggered by an updateAndSync and the only way it should hit that is if the timer fired. So it might be the initial timer as stated above. I'll put up a PR with some more debugging enabled to see if I can verify it.

@HyukjinKwon
Copy link
Member

Thank you @tgravescs a lot for taking a look for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants