[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18623

andrey-tpt · 2017-07-13T11:53:37Z

What changes were proposed in this pull request?

SparkHadoopUtil.globPath method uses incorrect configuration to retrieve instance of org.apache.hadoop.fs.FileSystem.

Accidentally, this can work correctly for two reasons:

Filesystem cache is enabled by default for all filesystems which are derived from org.apache.hadoop.fs.FileSystem
Configuration is not considered when instance of FileSystem is retrieved from the cache - it is not used to identify cache's key.

Therefore, incorrect configuration is omitted in SparkHadoopUtil.globPath and previously initialized instance of FileSystem is returned with correct configuration.

However, if filesystem caching is disabled (non-default behavior) incorrect configuration in SparkHadoopUtils.globPath is passed to org.apache.hadoop.fs.FileSystem.get() method what creates new instance of FileSystem with this incorrect configuration.

In this change two overloaded methods (globPath and globPathIfNecessary) are added to SparkHadoopUtil class which can receive up to date configuration from caller method. These two methods are used in DataSource class to read into DataFrame from globbed path.

How was this patch tested?

./dev/run-tests passed + example from SPARK-21374 was tested with patched jars.

@zsxwing @liancheng

…sabled FS cache

zsxwing · 2017-07-19T17:53:06Z

ok to test

SparkQA · 2017-07-19T20:52:43Z

Test build #79768 has finished for PR 18623 at commit 317bd1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-08-03T23:30:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

        val fs = hdfsPath.getFileSystem(hadoopConf)
        val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
-        SparkHadoopUtil.get.globPathIfNecessary(qualified)
+        SparkHadoopUtil.get.globPathIfNecessary(qualified, hadoopConf)


Could you pass FileSystem into globPathIfNecessary?

…sabled FS cache This PR replaces #18623 to do some clean up. Closes #18623 Jenkins Author: Shixiong Zhu <[email protected]> Author: Andrey Taptunov <[email protected]> Closes #18848 from zsxwing/review-pr18623.

…sabled FS cache This PR replaces apache#18623 to do some clean up. Closes apache#18623 Jenkins Author: Shixiong Zhu <[email protected]> Author: Andrey Taptunov <[email protected]> Closes apache#18848 from zsxwing/review-pr18623.

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with di…

317bd1b

…sabled FS cache

zsxwing reviewed Aug 3, 2017

View reviewed changes

zsxwing mentioned this pull request Aug 4, 2017

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18848

Closed

asfgit closed this in 6cbd18c Aug 5, 2017

andrey-tpt deleted the globpath2 branch April 7, 2018 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18623

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18623

Uh oh!

andrey-tpt commented Jul 13, 2017

Uh oh!

zsxwing commented Jul 19, 2017

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

zsxwing Aug 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18623

[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache #18623

Uh oh!

Conversation

andrey-tpt commented Jul 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing commented Jul 19, 2017

Uh oh!

SparkQA commented Jul 19, 2017

Uh oh!

zsxwing Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants