-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-1230: [WIP] Enable SparkContext.addJars() to load classes not in CLASSPATH #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
8854dab
ae1c199
5701bf4
d3df241
5c24b0b
cbabe80
b0cc61d
97a19b5
9757c6f
3e52b0c
178083d
b132d7b
95b24f2
fbcb4a0
9637d21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -134,7 +134,7 @@ class SparkContext( | |
| // driver. Do this before all other initialization so that any thread pools created for this | ||
| // SparkContext uses the class loader. | ||
| // Note that this is config-enabled as classloaders can introduce subtle side effects | ||
| private[spark] val classLoader = if (conf.getBoolean("spark.driver.add-dynamic-jars", false)) { | ||
| private[spark] val classLoader = if (conf.getBoolean("spark.driver.loadAddedJars", false)) { | ||
| val loader = new SparkURLClassLoader(Array.empty[URL], this.getClass.getClassLoader) | ||
| Thread.currentThread.setContextClassLoader(loader) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this only work if addJars is called from the thread that created the SparkContext?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will capture a pointer to the classlaoder in which the SC was created. So addJars can be called from anywhere and it will always augment this class loader. I think this means that the class will be visible to (a) the thread that created the sc and (b) any threads created by that thread. Though it would be good to verify that the context class loader is passed on to child threads or they delegate to that of the parent. This does mean that a thread entirely outside of the SparkContext-creating thread and its children won't have the class loaded. I think that's actually desirable given that you may have a case where mutliple SparkContext's are created in the same JVM.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will capture a pointer to the classlaoder in which the SC was created. So addJars can be called from anywhere and it will always augment this class loader. I think this means that the class will be visible to (a) the thread that created the sc and (b) any threads created by that thread. Though it would be good to verify that the context class loader is passed on to child threads or they delegate to that of the parent. This does mean that a thread entirely outside of the SparkContext-creating thread and its children won't have the class loaded. I think that's actually desirable given that you may have a case where mutliple SparkContext's are created in the same JVM.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll defer to @velvia on this one though as it's his design.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, ok, I understand now. In that case, to make things simpler, would it possibly make sense to not load the jars to the current thread and only load them for the SparkContext/executors? Classloader stuff can be confusing to deal with and keeping it as isolated as possible could make things easier for users. This would also line up a little more with how the MR distributed cache works - jars that get added to it don't become accessible for to driver code.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey Sandy - not sure what you mean exactly by "load them for the SparkContext". The SparkContext is just a java object. The scenario we want to handle is like this: There are two ways "Foo" can be visible for the list line. Either it can be included in the classpath when launching the JVM or it can be added dynamically to the classloader of the calling thread. Is there another way?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had misunderstood how the original mechanism worked. I take this all back. |
||
| Some(loader) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -394,12 +394,13 @@ Apart from these, the following properties are also available, and may be useful | |
| </td> | ||
| </tr> | ||
| <tr> | ||
| <td>spark.driver.add-dynamic-jars</td> | ||
| <td>spark.driver.loadAddedJars</td> | ||
| <td>false</td> | ||
| <td> | ||
| If true, the SparkContext uses a class loader to make jars added via `addJar` available to the SparkContext. | ||
| The default behavior is that jars added via `addJar` are only made available to executors, and Spark apps | ||
| must include all its jars in the application CLASSPATH even if `addJar` is used. | ||
| If true, the SparkContext uses a class loader to make jars added via `addJar` available to | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could the second sentence be simplified to "The default behavior is that jars added via addJar must already be on the classpath."?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call. |
||
| the SparkContext. The default behavior is that jars added via `addJar` are only made | ||
| available to executors, and Spark apps must include all its jars in the driver's | ||
| CLASSPATH even if `addJar` is used. | ||
| </td> | ||
| </tr> | ||
| <tr> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should set its parent to the current thread's context class loader if one exists. Otherwise users who try to add some class loader before starting SparkContext (e.g. if they're in some server environment) will lose it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch - this is definitely something that needs to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I'm pretty sure there is almost no way that Spark contexts can work properly inside of some server environment, with simply using Thread context classloaders. The reason is that Spark spins up so many other threads. To make everything work easier, I believe we should instead have a standard classloader set in SparkEnv or somewhere like that, which can inherit from Thread context in the thread that started SparkContext, but which can be used everywhere else that spins up new threads.