[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23446

KeiichiHirobe · 2019-01-04T05:47:25Z

What changes were proposed in this pull request?

My pull request #23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here.
Commit 92934b4 is only change after pull request #23288.
CheckFileExist was avoided at 239cfa4 after discussing #23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.

Test

spark/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Line 2555 in 27e42c1

test("SPARK-19059: read file based table whose name starts with underscore") {

failed when we avoided checkFileExist, but now successed after commit 92934b4 .

How was this patch tested?

Both of below tests were passed.

testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite
testOnly org.apache.spark.sql.SQLQuerySuite

… is ignored

…le which is ignored" This reverts commit 08850ae.

…derscore" This reverts commit 2910cb9.

…hrow an exception only if all of the files are filtered out

CheckFileExist was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after this commit.

HyukjinKwon · 2019-01-04T05:58:48Z

ok to test

HyukjinKwon · 2019-01-04T05:59:04Z

Let's see the test failure. cc @srowen as well.

HyukjinKwon · 2019-01-04T05:59:40Z

Can you fix the PR title to [SPARK-26339][SQL]Throws better exception when reading files that start with underscore?

SparkQA · 2019-01-04T08:05:01Z

Test build #100722 has finished for PR 23446 at commit 92934b4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-04T10:04:33Z

retest this please

SparkQA · 2019-01-04T13:48:42Z

Test build #100728 has finished for PR 23446 at commit 92934b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-05T14:04:52Z

@KeiichiHirobe can you describe why this PR was reverted and what's changed more in this PR?

KeiichiHirobe · 2019-01-05T14:37:15Z

@HyukjinKwon
Could you refer to commit message of 92934b4?
This commit is only change after pull request #23288.

HyukjinKwon · 2019-01-05T14:43:58Z

Yes, I meant leave some comments about that commit, why it was reverted, and how this PR fixed that in the PR description.

KeiichiHirobe · 2019-01-05T15:03:53Z

@HyukjinKwon
I got it!
I updated PR description.
Could you please check that change?

srowen · 2019-01-06T14:52:28Z

Merged to master

gatorsmile · 2019-01-07T04:29:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      }
+      if (filteredOut.nonEmpty) {
+        if (filteredIn.isEmpty) {
+          throw new AnalysisException(


I am afraid this could break the existing applications. Currently, when users specify the schema, no exception is thrown, right?

cc @HyukjinKwon @srowen @MaxGekk @cloud-fan

Yea, it was discussed:

#23288 (comment)
#23288 (review)

I don't have a strong opinion on this. If you think it should be considered as a behaviour change, yea, no objection from me. We can turn it to warning.

Please submit a follow-up PR to change it to a warning? Thanks!

Sure, let me make a followup by the end of today (singapore time)

Yeah that's a fair point. It might not have thrown an exception later if it didn't try to infer schema.

…ception for underscore files ## What changes were proposed in this pull request? The PR #23446 happened to introduce a behaviour change - empty dataframes can't be read anymore from underscore files. It looks controversial to allow or disallow this case so this PR targets to fix to issue warning instead of throwing an exception to be more conservative. **Before** ```scala scala> spark.read.schema("a int").parquet("_tmp*").show() org.apache.spark.sql.AnalysisException: All paths were ignored: file:/.../_tmp file:/.../_tmp1; at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:651) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:635) ... 49 elided scala> spark.read.text("_tmp*").show() org.apache.spark.sql.AnalysisException: All paths were ignored: file:/.../_tmp file:/.../_tmp1; at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:723) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:695) ... 49 elided ``` **After** ```scala scala> spark.read.schema("a int").parquet("_tmp*").show() 19/01/07 15:14:43 WARN DataSource: All paths were ignored: file:/.../_tmp file:/.../_tmp1 +---+ | a| +---+ +---+ scala> spark.read.text("_tmp*").show() 19/01/07 15:14:51 WARN DataSource: All paths were ignored: file:/.../_tmp file:/.../_tmp1 +-----+ |value| +-----+ +-----+ ``` ## How was this patch tested? Manually tested as above. Closes #23481 from HyukjinKwon/SPARK-26339. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…art with underscore ## What changes were proposed in this pull request? My pull request apache#23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here. Commit 92934b4 is only change after pull request apache#23288. `CheckFileExist` was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after commit 92934b4 . ## How was this patch tested? Both of below tests were passed. ``` testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite testOnly org.apache.spark.sql.SQLQuerySuite ``` Closes apache#23446 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ception for underscore files ## What changes were proposed in this pull request? The PR apache#23446 happened to introduce a behaviour change - empty dataframes can't be read anymore from underscore files. It looks controversial to allow or disallow this case so this PR targets to fix to issue warning instead of throwing an exception to be more conservative. **Before** ```scala scala> spark.read.schema("a int").parquet("_tmp*").show() org.apache.spark.sql.AnalysisException: All paths were ignored: file:/.../_tmp file:/.../_tmp1; at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:651) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:635) ... 49 elided scala> spark.read.text("_tmp*").show() org.apache.spark.sql.AnalysisException: All paths were ignored: file:/.../_tmp file:/.../_tmp1; at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:723) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:695) ... 49 elided ``` **After** ```scala scala> spark.read.schema("a int").parquet("_tmp*").show() 19/01/07 15:14:43 WARN DataSource: All paths were ignored: file:/.../_tmp file:/.../_tmp1 +---+ | a| +---+ +---+ scala> spark.read.text("_tmp*").show() 19/01/07 15:14:51 WARN DataSource: All paths were ignored: file:/.../_tmp file:/.../_tmp1 +-----+ |value| +-----+ +-----+ ``` ## How was this patch tested? Manually tested as above. Closes apache#23481 from HyukjinKwon/SPARK-26339. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: gatorsmile <[email protected]>

Hirobe Keiichi added 11 commits December 11, 2018 22:34

Throws better exception when reading files that start with underscore

2910cb9

Clarify the message further with a different exception for file which…

08850ae

… is ignored

Revert "Clarify the message further with a different exception for fi…

777b4db

…le which is ignored" This reverts commit 08850ae.

Revert "Throws better exception when reading files that start with un…

c0a57d9

…derscore" This reverts commit 2910cb9.

Log a debug statement about files/directories that are ignored, and t…

1b64ffb

…hrow an exception only if all of the files are filtered out

Change to check only filename match, not check dir recursively

a95637e

Avoid checkFilesExist check

239cfa4

Minor modifications for clarity

e72bf00

Remove debugStatement test

abfe2e6

Fix title of test case

3708ef1

KeiichiHirobe mentioned this pull request Jan 4, 2019

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23288

Closed

KeiichiHirobe changed the title ~~Spark 26339~~ [SPARK-26339][SQL]Throws better exception when reading files that start with underscore Jan 4, 2019

srowen approved these changes Jan 4, 2019

View reviewed changes

HyukjinKwon approved these changes Jan 5, 2019

View reviewed changes

srowen closed this in 9d8e9b3 Jan 6, 2019

gatorsmile reviewed Jan 7, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Jan 7, 2019

[SPARK-26339][SQL][FOLLOW-UP] Issue warning instead of throwing an exception for underscore files #23481

Closed

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23446

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23446

Uh oh!

Conversation

KeiichiHirobe commented Jan 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 4, 2019

Uh oh!

HyukjinKwon commented Jan 4, 2019

Uh oh!

HyukjinKwon commented Jan 4, 2019

Uh oh!

SparkQA commented Jan 4, 2019

Uh oh!

HyukjinKwon commented Jan 4, 2019

Uh oh!

SparkQA commented Jan 4, 2019

Uh oh!

HyukjinKwon commented Jan 5, 2019

Uh oh!

KeiichiHirobe commented Jan 5, 2019

Uh oh!

HyukjinKwon commented Jan 5, 2019

Uh oh!

KeiichiHirobe commented Jan 5, 2019

Uh oh!

srowen commented Jan 6, 2019

Uh oh!

gatorsmile Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

KeiichiHirobe commented Jan 4, 2019 •

edited

Loading