-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40364][CORE] Use the unified DBProvider#initDB method
#37826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @mridulm @tgravescs and @squito for further discussion, the previous thread is here #37648 (comment) |
|
GA failed not related to current pr, need wait #37815 |
|
GHCR recovered just do a retriggered |
| return tmpDb; | ||
| } | ||
|
|
||
| @VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as @mridulm said in the previous review this does add a bit of behavior change where we could hide a corruption that likely would have failed the test before. It would seem like that is a pretty rare case though and the test wasn't specifically trying to test that.
Honestly I can go either way on this change, I'm ok with it but at the same time don't think these extra functions are much maintenance unless you had other reasons for removing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wanted to create some failed cases to compare differences before and after this pr, I found an interesting thing:
There are 3 cases are related to the code we discussed: YarnShuffleIntegrationWithLevelDBBackendSuite, YarnShuffleAuthWithLevelDBBackendSuite, YarnShuffleAlternateNameConfigWithLevelDBBackendSuite.
But when I add println(s"registeredExecFile = $registeredExecFile") after val registeredExecFile = YarnTestAccessor.getRegisteredExecutorFile(shuffleService)(line 73 as follow code)
Lines 69 to 86 in 5a599de
| test("external shuffle service") { | |
| val shuffleServicePort = YarnTestAccessor.getShuffleServicePort | |
| val shuffleService = YarnTestAccessor.getShuffleServiceInstance | |
| val registeredExecFile = YarnTestAccessor.getRegisteredExecutorFile(shuffleService) | |
| val result = File.createTempFile("result", null, tempDir) | |
| val finalState = runSpark( | |
| false, | |
| mainClassName(YarnExternalShuffleDriver.getClass), | |
| appArgs = if (registeredExecFile != null) { | |
| Seq(result.getAbsolutePath, registeredExecFile.getAbsolutePath) | |
| } else { | |
| Seq(result.getAbsolutePath) | |
| }, | |
| extraConf = extraSparkConf() | |
| ) | |
| checkResult(finalState, result) |
I found all 3 case print registeredExecFile = null, so the code we discussed was not actually executed due to registeredExecFile is always null ....
I also test these 3 cases use Spark 3.3, The problem also exists(registeredExecFile = null), I haven't found out which version of this code began to not execute, but this may be a very old bug.
@tgravescs @mridulm do you know when YarnShuffleService should call the setRecoveryPath method? It seems that these tests did not call the setRecoveryPath before serviceInit, so registeredExecFile is always null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember in this particular case, its called from YarnShuffleServiceSuite, maybe it was never setup correctly, in which case maybe it doesn't matter and we aren't changing behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw that YarnShuffleServiceSuite called it. But I'm a little curious, how does NodeManager set RecoveryPath ? If not set, will LevelDB not be initialized? Let me investigate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from context of #19032, The enable or disable of LevelDB is related to YarnConfiguration.NM_RECOVERY_ENABLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mridulm It seems that LevelDB/RocksDB diskstore is not always enabled when YarnShuffleService is used
spark/docs/spark-standalone.md
Lines 321 to 330 in 0996a15
| <td><code>spark.shuffle.service.db.enabled</code></td> | |
| <td>true</td> | |
| <td> | |
| Store External Shuffle service state on local disk so that when the external shuffle service is restarted, it will | |
| automatically reload info on current executors. This only affects standalone mode (yarn always has this behavior | |
| enabled). You should also enable <code>spark.worker.cleanup.enabled</code>, to ensure that the state | |
| eventually gets cleaned up. This config may be removed in the future. | |
| </td> | |
| <td>3.0.0</td> | |
| </tr> |
the description(yarn always has this behavior enabled) in spark.shuffle.service.db.enabled is incorrect, the behavior of YarnShuffleService is not controlled by this config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give a pr to fix this description: #37853
|
friendly ping @sunchao, do you know I found |
|
Lines 46 to 54 in bf5103a
add then failed as follows: It seems that we need to use a static port to support testing recovery |
|
So I think |
|
Thanks for the ping @LuciferYang , I'll take a look at the shading issue in Hadoop 3.3.4. |
|
thanks @sunchao |
|
I am unsure of this test - will rely on @tgravescs to comment better. |
|
@tgravescs I tried to add |
|
The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily |
Let me try |
Sorry for missing this comment |
What changes were proposed in this pull request?
There are 2
initDBinDBProviderDB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper)DB initDB(DBBackend dbBackend, File file)The first method is used in production code and and the second one only used by
ShuffleTestAccessorforYarnShuffleIntegrationSuiteas follows:spark/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnShuffleIntegrationSuite.scala
Lines 151 to 170 in 0be27b6
This pr change to use the first method instead of second one.
Why are the changes needed?
Code clean up.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass Github Actions