Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions R/pkg/R/context.R
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,8 @@ setCheckpointDirSC <- function(sc, dirName) {
#' Currently directories are only supported for Hadoop-supported filesystems.
#' Refer Hadoop-supported filesystems at \url{https://wiki.apache.org/hadoop/HCFS}.
#'
#' Note: A path can be added only once. Subsequent additions of the same path are ignored.
#'
#' @rdname spark.addFile
#' @param path The path of the file to be added
#' @param recursive Whether to add files recursively from the path. Default is FALSE.
Expand Down
12 changes: 12 additions & 0 deletions core/src/main/scala/org/apache/spark/SparkContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1496,6 +1496,8 @@ class SparkContext(config: SparkConf) extends Logging {
* @param path can be either a local file, a file in HDFS (or other Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,
* use `SparkFiles.get(fileName)` to find its download location.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addFile(path: String): Unit = {
addFile(path, false)
Expand All @@ -1516,6 +1518,8 @@ class SparkContext(config: SparkConf) extends Logging {
* use `SparkFiles.get(fileName)` to find its download location.
* @param recursive if true, a directory can be given in `path`. Currently directories are
* only supported for Hadoop-supported filesystems.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addFile(path: String, recursive: Boolean): Unit = {
val uri = new Path(path).toUri
Expand Down Expand Up @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging {
Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
env.securityManager, hadoopConfiguration, timestamp, useCache = false)
postEnvironmentUpdate()
} else {
logWarning(s"The path $path has been added already. Overwriting of added paths " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, @MaxGekk, how about we just give warnings without notes for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notes are also reasonable to me. This is a common user error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering how common it is for an user to add the same jar expecting it will overwrite since mostly we consider those cases as immutable resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Our support receives a few "bug" reports per months. For now we can provide a link to the note at least. The warning itself is needed to our support engineers to detect such kind of problems from logs of already finished jobs. Actually customers do not say in their bug reports that files/jars weren't overwritten (it would be easier). They report problems like calling a method from a lib crashes due to incompatible signature of method or a class doesn't exists. Or final result of a Spark job is not correct because a config/resource files added via addFile() is not up to date. Now I can detect the situation from logs and provide a link to docs for addFile()/addJar().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I'm happy with the warning message but less sure if we note. I'm okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell we leave a JIRA link as a comment for example SPARK-16787 and/or SPARK-19417

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally do not post the JIRA number in the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant a comment.

"is not supported in the current version.")
}
}

Expand Down Expand Up @@ -1803,6 +1810,8 @@ class SparkContext(config: SparkConf) extends Logging {
*
* @param path can be either a local file, a file in HDFS (or other Hadoop-supported filesystems),
* an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addJar(path: String) {
def addJarFile(file: File): String = {
Expand Down Expand Up @@ -1849,6 +1858,9 @@ class SparkContext(config: SparkConf) extends Logging {
if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
logInfo(s"Added JAR $path at $key with timestamp $timestamp")
postEnvironmentUpdate()
} else {
logWarning(s"The jar $path has been added already. Overwriting of added jars " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could confuse what it means with spark.files.overwrite.

"is not supported in the current version.")
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -668,6 +668,8 @@ class JavaSparkContext(val sc: SparkContext)
* The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,
* use `SparkFiles.get(fileName)` to find its download location.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addFile(path: String) {
sc.addFile(path)
Expand All @@ -681,6 +683,8 @@ class JavaSparkContext(val sc: SparkContext)
*
* A directory can be given if the recursive option is set to true. Currently directories are only
* supported for Hadoop-supported filesystems.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addFile(path: String, recursive: Boolean): Unit = {
sc.addFile(path, recursive)
Expand All @@ -690,6 +694,8 @@ class JavaSparkContext(val sc: SparkContext)
* Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.
* The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI.
*
* @note A path can be added only once. Subsequent additions of the same path are ignored.
*/
def addJar(path: String) {
sc.addJar(path)
Expand Down
4 changes: 4 additions & 0 deletions python/pyspark/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -847,6 +847,8 @@ def addFile(self, path, recursive=False):
A directory can be given if the recursive option is set to True.
Currently directories are only supported for Hadoop-supported filesystems.

.. note:: A path can be added only once. Subsequent additions of the same path are ignored.

>>> from pyspark import SparkFiles
>>> path = os.path.join(tempdir, "test.txt")
>>> with open(path, "w") as testFile:
Expand All @@ -867,6 +869,8 @@ def addPyFile(self, path):
SparkContext in the future. The C{path} passed can be either a local
file, a file in HDFS (or other Hadoop-supported filesystems), or an
HTTP, HTTPS or FTP URI.

.. note:: A path can be added only once. Subsequent additions of the same path are ignored.
"""
self.addFile(path)
(dirname, filename) = os.path.split(path) # dirname may be directory or HDFS/S3 prefix
Expand Down