-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33782][K8S][CORE]Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode #37417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon Please review , build is failing because of un related error (since its passing on local) |
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| test("SPARK-33782 : handles k8s files download to current directory") { | |
| test("SPARK-33782 : handles k8s files download to current directory") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's import this on the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why do we try this unpack for jars and files too? I think we should just call downloadFileList for them
|
cc @dongjoon-hyun @tgravescs @Ngone51 FYI |
Is this true? If I specify a file to pass along and my code references it, what did it do before? I know normally on yarn I would do some like ./XXXX. I'm a bit worried here about breaking people if we move location and they expected it somewhere else. |
|
https://github.com/apache/spark/pull/30735/files#r540935585 , In case K8s files are not copied to current working directory . Files are available in local but not in current working directory for driver (like the way it happens in yarn) |
|
ping @dongjoon-hyun @Ngone51 |
1 similar comment
|
ping @dongjoon-hyun @Ngone51 |
|
ping @dongjoon-hyun @Ngone51 . I think this is important feature to have the same behaviour as yarn . Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (isArchive) Utils.unpack(source, dest) else Files.copy(source.toPath, dest.toPath) | |
| if (isArchive) { | |
| Utils.unpack(source, dest) | |
| } else { | |
| Files.copy(source.toPath, dest.toPath) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jarsLocalJars -> localJars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since localJars is already defined above , used this variable (which is simillar to filesLocalFiles)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| test("SPARK-33782 : handles k8s files download to current directory") { | |
| test("SPARK-33782: handles k8s files download to current directory") { |
|
The change generally looks good to me. It'd be good if @dongjoon-hyun could take a look since he has more knowledge in K8s. |
|
Thx @Ngone51 , for reviewing this , i'll do the suggested changes. @dongjoon-hyun Please review the PR . |
|
Gentle ping @dongjoon-hyun |
1 similar comment
|
Gentle ping @dongjoon-hyun |
|
@dongjoon-hyun , please review this , it will help to have K8s similar functionality as Yarn . Since this is target for Spark3.4 (as per jira) , IMHO , it would be great if u can spend some time on it . cc @HyukjinKwon (since jira is created by [HyukjinKwon]) |
63c7390 to
77cf564
Compare
|
@dongjoon-hyun , Have incorporated all the review comments , please look into the same. |
|
@HyukjinKwon @Ngone51 @dongjoon-hyun |
|
@HyukjinKwon , can u please reviewed or get it reviewed. Since this is important jira (opened by you) and also fix parity with yarn . |
|
@Yikun can u please review this PR . |
|
@Dooyoung-Hwang please let me know if this feature is not required . I can close the PR . However IMHO , this is a parity fix feature w.r.t to yarn |
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if no one else has any concerns I'll try and merge it this week or next.
|
@holdenk Thx for giving LGTM. Gentle reminder to please merge the PR |
|
Merged, thanks for the reminder. |
|
Thx @holdenk for your help . |
|
Thanks for making the PR :D |
…s under the current working directory on the driver in K8S cluster mode ### What changes were proposed in this pull request? This PR will place spark.files , spark.jars and spark.pyfiles to the current working directory on the driver in K8s cluster mode ### Why are the changes needed? This mimics the behaviour of Yarn and also helps user to access files from PWD . Also as mentioned in the jira By doing this, users can, for example, leverage PEX to manage Python dependences in Apache Spark: ``` pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested via unit test cases and also ran on local K8s cluster. Closes apache#37417 from pralabhkumar/rk_k8s_local_resource. Authored-by: pralabhkumar <[email protected]> Signed-off-by: Holden Karau <[email protected]>
|
@pralabhkumar thanks for your work. I noticed similar issue when running spark application on K8S, it's helpful feature However, this pr might have some inefficiency to download files/jars twice when running k8s-cluster mode. if (deployMode == CLIENT) {
// jars are downloaded once
localJars = Option(args.jars).map {
downloadFileList(_, targetDir, sparkConf, hadoopConf)
}.orNull
// py files are downloaded once
localPyFiles = Option(args.pyFiles).map {
downloadFileList(_, targetDir, sparkConf, hadoopConf)
}.orNull
if (isKubernetesClusterModeDriver) {
def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false): String = {
...
}
val filesLocalFiles = Option(args.files).map {
downloadResourcesToCurrentDirectory(_)
}.orNull
// jars are downloaded again
val jarsLocalJars = Option(args.jars).map {
downloadResourcesToCurrentDirectory(_)
}.orNull
val archiveLocalFiles = Option(args.archives).map {
downloadResourcesToCurrentDirectory(_, true)
}.orNull
// py files are downloaded again
val pyLocalFiles = Option(args.pyFiles).map {
downloadResourcesToCurrentDirectory(_)
}.orNull
}
}Would you mind to create a followup pr to address this issue? @pralabhkumar Also, there's another catch when running spark on k8s with --files/--archives: spark/core/src/main/scala/org/apache/spark/SparkContext.scala Lines 440 to 443 in d407a42
And spark/core/src/main/scala/org/apache/spark/SparkContext.scala Lines 524 to 544 in d407a42
|
|
Maybe this is not the correct place to ask, but this PR had created some issues for our flow. Maybe our flow was not how things were intended to be used but it goes like this: |
…driver in K8S cluster mode ### What changes were proposed in this pull request? Adding working directory into classpath on the driver in K8S cluster mode. ### Why are the changes needed? After #37417, the spark.files, spark.jars are placed in the working directory. But seems that the spark context classloader can not access them because they are not in the classpath by default. This pr adds the current working directory into classpath, so that the spark.files, spark.jars placed in the working directory can be accessible by the classloader. For example, the `hive-site.xml` uploaded by `spark.files`. ### Does this PR introduce _any_ user-facing change? yes, users do not need to add the working directory into spark classpath manually. ### How was this patch tested? UT. Closes #41201 from turboFei/work_dir_classpath. Authored-by: fwang12 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…driver in K8S cluster mode ### What changes were proposed in this pull request? Adding working directory into classpath on the driver in K8S cluster mode. ### Why are the changes needed? After apache#37417, the spark.files, spark.jars are placed in the working directory. But seems that the spark context classloader can not access them because they are not in the classpath by default. This pr adds the current working directory into classpath, so that the spark.files, spark.jars placed in the working directory can be accessible by the classloader. For example, the `hive-site.xml` uploaded by `spark.files`. ### Does this PR introduce _any_ user-facing change? yes, users do not need to add the working directory into spark classpath manually. ### How was this patch tested? UT. Closes apache#41201 from turboFei/work_dir_classpath. Authored-by: fwang12 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
@pralabhkumar @HyukjinKwon @holdenk Here is the Error Message |
What changes were proposed in this pull request?
This PR will place spark.files , spark.jars and spark.pyfiles to the current working directory on the driver in K8s cluster mode
Why are the changes needed?
This mimics the behaviour of Yarn and also helps user to access files from PWD . Also as mentioned in the jira
By doing this, users can, for example, leverage PEX to manage Python dependences in Apache Spark:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested via unit test cases and also ran on local K8s cluster.