Skip to content

Conversation

@KeiichiHirobe
Copy link

@KeiichiHirobe KeiichiHirobe commented Jan 4, 2019

What changes were proposed in this pull request?

My pull request #23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here.
Commit 92934b4 is only change after pull request #23288.
CheckFileExist was avoided at 239cfa4 after discussing #23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.

Test

test("SPARK-19059: read file based table whose name starts with underscore") {

failed when we avoided checkFileExist, but now successed after commit 92934b4 .

How was this patch tested?

Both of below tests were passed.

testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite
testOnly org.apache.spark.sql.SQLQuerySuite

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Let's see the test failure. cc @srowen as well.

@HyukjinKwon
Copy link
Member

Can you fix the PR title to [SPARK-26339][SQL]Throws better exception when reading files that start with underscore?

@KeiichiHirobe KeiichiHirobe changed the title Spark 26339 [SPARK-26339][SQL]Throws better exception when reading files that start with underscore Jan 4, 2019
@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100722 has finished for PR 23446 at commit 92934b4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100728 has finished for PR 23446 at commit 92934b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@KeiichiHirobe can you describe why this PR was reverted and what's changed more in this PR?

@KeiichiHirobe
Copy link
Author

@HyukjinKwon
Could you refer to commit message of 92934b4?
This commit is only change after pull request #23288.

@HyukjinKwon
Copy link
Member

Yes, I meant leave some comments about that commit, why it was reverted, and how this PR fixed that in the PR description.

@KeiichiHirobe
Copy link
Author

@HyukjinKwon
I got it!
I updated PR description.
Could you please check that change?

@srowen
Copy link
Member

srowen commented Jan 6, 2019

Merged to master

@srowen srowen closed this in 9d8e9b3 Jan 6, 2019
}
if (filteredOut.nonEmpty) {
if (filteredIn.isEmpty) {
throw new AnalysisException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this could break the existing applications. Currently, when users specify the schema, no exception is thrown, right?

cc @HyukjinKwon @srowen @MaxGekk @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it was discussed:

#23288 (comment)
#23288 (review)

I don't have a strong opinion on this. If you think it should be considered as a behaviour change, yea, no objection from me. We can turn it to warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please submit a follow-up PR to change it to a warning? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me make a followup by the end of today (singapore time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a fair point. It might not have thrown an exception later if it didn't try to infer schema.

asfgit pushed a commit that referenced this pull request Jan 7, 2019
…ception for underscore files

## What changes were proposed in this pull request?

The PR #23446 happened to introduce a behaviour change - empty dataframes can't be read anymore from underscore files. It looks controversial to allow or disallow this case so this PR targets to fix to issue warning instead of throwing an exception to be more conservative.

**Before**

```scala
scala> spark.read.schema("a int").parquet("_tmp*").show()
org.apache.spark.sql.AnalysisException: All paths were ignored:
file:/.../_tmp
  file:/.../_tmp1;
  at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:651)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:635)
  ... 49 elided

scala> spark.read.text("_tmp*").show()
org.apache.spark.sql.AnalysisException: All paths were ignored:
file:/.../_tmp
  file:/.../_tmp1;
  at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:723)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:695)
  ... 49 elided
```

**After**

```scala
scala> spark.read.schema("a int").parquet("_tmp*").show()
19/01/07 15:14:43 WARN DataSource: All paths were ignored:
  file:/.../_tmp
  file:/.../_tmp1
+---+
|  a|
+---+
+---+

scala> spark.read.text("_tmp*").show()
19/01/07 15:14:51 WARN DataSource: All paths were ignored:
  file:/.../_tmp
  file:/.../_tmp1
+-----+
|value|
+-----+
+-----+
```

## How was this patch tested?

Manually tested as above.

Closes #23481 from HyukjinKwon/SPARK-26339.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…art with underscore

## What changes were proposed in this pull request?
My pull request apache#23288 was resolved and merged to master, but it turned out  later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here.
Commit 92934b4 is only change after pull request apache#23288.
`CheckFileExist` was avoided at 239cfa4 after discussing apache#23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.

Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555
failed when we avoided checkFileExist, but now successed after commit 92934b4 .

## How was this patch tested?
Both of below tests were passed.
```
testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite
testOnly org.apache.spark.sql.SQLQuerySuite
```

Closes apache#23446 from KeiichiHirobe/SPARK-26339.

Authored-by: Hirobe Keiichi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ception for underscore files

## What changes were proposed in this pull request?

The PR apache#23446 happened to introduce a behaviour change - empty dataframes can't be read anymore from underscore files. It looks controversial to allow or disallow this case so this PR targets to fix to issue warning instead of throwing an exception to be more conservative.

**Before**

```scala
scala> spark.read.schema("a int").parquet("_tmp*").show()
org.apache.spark.sql.AnalysisException: All paths were ignored:
file:/.../_tmp
  file:/.../_tmp1;
  at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:651)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:635)
  ... 49 elided

scala> spark.read.text("_tmp*").show()
org.apache.spark.sql.AnalysisException: All paths were ignored:
file:/.../_tmp
  file:/.../_tmp1;
  at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:570)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:723)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:695)
  ... 49 elided
```

**After**

```scala
scala> spark.read.schema("a int").parquet("_tmp*").show()
19/01/07 15:14:43 WARN DataSource: All paths were ignored:
  file:/.../_tmp
  file:/.../_tmp1
+---+
|  a|
+---+
+---+

scala> spark.read.text("_tmp*").show()
19/01/07 15:14:51 WARN DataSource: All paths were ignored:
  file:/.../_tmp
  file:/.../_tmp1
+-----+
|value|
+-----+
+-----+
```

## How was this patch tested?

Manually tested as above.

Closes apache#23481 from HyukjinKwon/SPARK-26339.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants