[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23288

KeiichiHirobe · 2018-12-11T13:40:35Z

What changes were proposed in this pull request?

As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore, fix this by throwing exception which message is "Path does not exist".

How was this patch tested?

manual tests.
Both of codes below throws exception which message is "Path does not exist".

spark.read.csv("/home/forcia/work/spark/_test.csv")
spark.read.schema("test STRING, number INT").csv("/home/forcia/work/spark/_test.csv")

AmplabJenkins · 2018-12-11T13:45:28Z

Can one of the admins verify this patch?

srowen · 2018-12-11T20:34:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

      // Don't need to check once again if files exist in streaming mode
-      if (checkFilesExist && !fs.exists(globPath.head)) {
+      if (checkFilesExist &&
+          (!fs.exists(globPath.head) || InMemoryFileIndex.shouldFilterOut(globPath.head.getName))) {


I'm probably misunderstanding, but doesn't this still cause it to throw a 'Path does not exist' exception?

InMemoryFileIndex.shouldFilterOut returns true if argument starts with underscore, so throw a 'Path does not exist' exception. I've checked and exception below was thrown.

org.apache.spark.sql.AnalysisException: Path does not exist: file:_test.csv; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:558) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:244) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:355) at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:545) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)

I see, I didn't read carefully. This is the new desired behavior. I agree it would be better to not end up with an odd CSV parsing message. I wonder if we can clarify the message further with a different exception for the new case. The path does exist; it's just ignored.

if (checkFilesExist) { val firstPath = globPath.head if (!fs.exists(firstPath)) { // ... Path does not exist } else if (shouldFilterOut...) { // ... Path exists but is ignored } }

Thank you for understanding my proposal.
Your suggestion looks better,I’ll push later.

@srowen I pushed now. Could you please check my commit?

… is ignored

HyukjinKwon · 2018-12-14T07:23:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+        if (!fs.exists(firstPath)) {
+          throw new AnalysisException(s"Path does not exist: ${firstPath}")
+        } else if (InMemoryFileIndex.shouldFilterOut(firstPath.getName)) {
+          throw new AnalysisException(s"Path exists but is ignored: ${firstPath}")


One thing i'm not sure tho, it's going to throw an exception for, for instance,

spark.read.text("_text.txt").show()

instead of returning an empty dataframe - which is kind of a behaviour change.

Also, looks it's going to not check children.

HyukjinKwon · 2018-12-14T07:24:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-      if (checkFilesExist && !fs.exists(globPath.head)) {
-        throw new AnalysisException(s"Path does not exist: ${globPath.head}")
+      if (checkFilesExist) {
+        val firstPath = globPath.head


Also, does it make sense to check only the first file? Looks multiple files could be detected.

HyukjinKwon · 2018-12-14T07:40:07Z

I think one thing we could consider is to leave a trace or debug log to show which files are ignored when the file is listed.

srowen

Yeah, we may need a different approach here. Really we want to check all the files in the glob, and maybe log a debug statement about ones that are ignored, and throw an exception only if all of the files are filtered out. Otherwise this fails if the first file happens to have an underscore, but others don't, and there is a convention that _files are just ignored in Hadoop/Spark/etc.

That would cover the original case where the user specified one file that is ignored.

I don't even know if it's a behavior change in many cases, because several things will fail later anyway (like CSV schema inference here) if there are no files. If the user specifies a path that has only underscore files, I could see the argument that it should just produce an empty dataframe, but, that could be surprising as well if there are data files there and it's just happening because of the underscores.

If someone felt strongly about not changing the behavior (elsewhere), then we could just log a warning instead, when all files are filtered out.

kiszk · 2018-12-15T06:19:53Z

Good catch, could you please add test cases that throw this exception for a file and multiple files?

KeiichiHirobe · 2018-12-15T07:24:48Z

I agree with srowen's idea.
Most cases, behaviour change makes no problem,I think.
So, May I implement a new behavior, or should I wait for a moment?

srowen · 2018-12-18T15:18:48Z

I think we should implement something along the lines of my comment above: #23288 (review)

HyukjinKwon · 2018-12-19T00:57:21Z

Yup. I think so too.

…le which is ignored" This reverts commit 08850ae.

…derscore" This reverts commit 2910cb9.

…hrow an exception only if all of the files are filtered out

KeiichiHirobe · 2018-12-27T08:54:01Z

I implemented a new behavior and added tests.
Please check it!

In checkAndGlobPathIfNecessary, calling InMemoryFileIndex.bulkListLeafFiles.
I am not familiar with cost of calling InMemoryFileIndex.bulkListLeafFiles, is this expensive?

If so, maybe we should implement a new method in object InMemoryFileIndex which returns information of filtered out files/directories and whether filtered files(regular files) exist or not.

And, please note that I considerd for not only InMemoryFileIndex.shouldFilterOut but also PathFilter.accept.

HyukjinKwon · 2018-12-27T10:31:44Z

Yea, listing files itself is non-trivial in some cases (in particular when you use, for instance, S3) and extra listing up should be avoided. Looks the change is a bit invasive, and does another file listing.

Can we simply do the check when the existing file list happens, rather than doing another file listing?

KeiichiHirobe · 2018-12-28T04:17:27Z

Thank you for your reply.

Let me make sure what you mean,
I don't know for certain whether I should list filterOuted files/dirs recursively or list files/dirs only directly under specified path.

For the example below, when spark.read.csv("foo") which is the best one that we should debug log?
Now, we debugging No.1.

_a.csv , _b.csv and _bar
_a.csv and _bar
_a.csv

I noticed below behavior for your information.
bulkListLeafFiles lists _a.csv and _b.csv(recursively listed)
spark.read.csv("foo").show lists only a.csv(read only files directly under specified path)

foo/
　├ a.csv
　├ _a.csv
　├ bar/
　│　├ b.csv
　│　└ _b.csv
　├ _bar/
　│　├ c.csv
　│　├ _c.csv

srowen · 2018-12-28T14:31:21Z

I don't think any of these methods like CSV recurse into subdirectories; you can supply globs that specify files across subdirs. I don't think the path matters here, just the filenames that match. That's what needs to be checked after listing all matching files.

KeiichiHirobe · 2018-12-28T18:14:48Z

I get the point.
Looking at #23288 (comment),
I misunderstood we should check recursively if specified dir.

I'll push later.

KeiichiHirobe · 2018-12-29T13:33:12Z

I pushed now.
Could you please check my commit?

srowen · 2018-12-29T13:38:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

  }

+  test("SPARK-26339 Debug statement if some of specified paths are filtered out") {
+    class TestAppender extends AppenderSkeleton {


I wouldn't bother with this complexity to test if the debug log was printed; it's not important compared to the additional binding to log4j.

I fixed it.

srowen · 2018-12-29T13:39:06Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+    try {
+      val cars = spark
+        .read
+        .format("csv")


Nit: you can use .csv instead of format and load

I fixed it.

srowen · 2018-12-29T13:40:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

    }.toSeq
+
+    if (checkFilesExist) {
+      val (filtered, filteredOut) = allGlobPath.partition { path =>


Nit: I'd call filtered as filteredIn to avoid ambiguity. It might also be very slightly cleaner to avoid the ! in the expression and flip these two values.

I fixed it.

srowen · 2018-12-29T13:41:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+      if (filteredOut.nonEmpty) {
+        if (filtered.isEmpty) {
+          throw new AnalysisException(
+            "All path were ignored. The following path were ignored:\n" +


path -> paths. Also, it seems clearer to say: "All paths were ignored:\n" and below, "Some paths were ignored:\n"

I fixed it.

srowen · 2018-12-29T13:42:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+              s"${filteredOut.mkString("\n  ")}")
+        } else {
+          logDebug(
+            "The following path were ignored:\n" +


Nit: for performance, make this one interpolated string. If the line is too long make the variable filteredOut something shorter like out

I fixed it.

srowen · 2018-12-29T13:43:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

      globPath
    }.toSeq
+
+    if (checkFilesExist) {


Do you need to remove the check and exception a few lines above then? It would fail if any path didn't have some files. (Also feel free to fix the indentation from line 549-558 above)

I am not sure, but irregular indentations seems to be due to GitHub preview CSS.

No need, I fixed it.

What about removing that check entirely?

I have already removed that check at
239cfa4#diff-7a6cb188d2ae31eb3347b5629a679cecR563

Or are you refering to checkFilesExist at line 557 and suggesting removing argument checkFilesExist ?

Yes I mean line 557. I guess we can keep that because, overall, we are trying to throw AnalysisException in more cases, not fewer. Before, if one of several glob paths matched no files at all (underscore or not) it would throw. OK, that behavior we can keep, I guess, or at least that's a separate question.

Disregard this; I think it is OK.

srowen · 2018-12-31T16:16:42Z

Merged to master

HyukjinKwon · 2019-01-01T01:28:08Z

@srowen, this didn't run the test! Looks some tests are being broken

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100599/testReport/org.apache.spark.sql/SQLQuerySuite/SPARK_19059__read_file_based_table_whose_name_starts_with_underscore/

java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: All paths were ignored: file:/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/spark-warehouse/_tbl;

Reverting this.

HyukjinKwon · 2019-01-01T01:36:03Z

@KeiichiHirobe, mind opening a PR again please? I also missed that the test didn't actually run. Looks the current change breaks another regression test. Can you take a look and fix it together?

srowen · 2019-01-01T14:52:06Z

Ack, darn, thank you. I was looking at a bunch of open PRs and probably looked at the wrong one to see if tests had run.

CheckFileExist was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after this commit.

KeiichiHirobe · 2019-01-04T05:51:14Z

@HyukjinKwon @srowen
According to this, we can not reopen pull request, so I created a new pull request #23446 . Could you please review my commit at #23446?

…art with underscore ## What changes were proposed in this pull request? My pull request #23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here. Commit 92934b4 is only change after pull request #23288. `CheckFileExist` was avoided at 239cfa4 after discussing #23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after commit 92934b4 . ## How was this patch tested? Both of below tests were passed. ``` testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite testOnly org.apache.spark.sql.SQLQuerySuite ``` Closes #23446 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…art with underscore ## What changes were proposed in this pull request? As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore, fix this by throwing exception which message is "Path does not exist". ## How was this patch tested? manual tests. Both of codes below throws exception which message is "Path does not exist". ``` spark.read.csv("/home/forcia/work/spark/_test.csv") spark.read.schema("test STRING, number INT").csv("/home/forcia/work/spark/_test.csv") ``` Closes apache#23288 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…art with underscore ## What changes were proposed in this pull request? My pull request apache#23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here. Commit 92934b4 is only change after pull request apache#23288. `CheckFileExist` was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after commit 92934b4 . ## How was this patch tested? Both of below tests were passed. ``` testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite testOnly org.apache.spark.sql.SQLQuerySuite ``` Closes apache#23446 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>

Throws better exception when reading files that start with underscore

2910cb9

KeiichiHirobe changed the title ~~Throws better exception when reading files that start with underscore~~ [SPARK-26339]Throws better exception when reading files that start with underscore Dec 11, 2018

KeiichiHirobe changed the title ~~[SPARK-26339]Throws better exception when reading files that start with underscore~~ [SPARK-26339][SQL]Throws better exception when reading files that start with underscore Dec 11, 2018

srowen reviewed Dec 11, 2018

View reviewed changes

Clarify the message further with a different exception for file which…

08850ae

… is ignored

HyukjinKwon reviewed Dec 14, 2018

View reviewed changes

srowen reviewed Dec 14, 2018

View reviewed changes

Hirobe Keiichi added 3 commits December 25, 2018 20:55

Revert "Clarify the message further with a different exception for fi…

777b4db

…le which is ignored" This reverts commit 08850ae.

Revert "Throws better exception when reading files that start with un…

c0a57d9

…derscore" This reverts commit 2910cb9.

Log a debug statement about files/directories that are ignored, and t…

1b64ffb

…hrow an exception only if all of the files are filtered out

Change to check only filename match, not check dir recursively

a95637e

srowen requested changes Dec 29, 2018

View reviewed changes

Hirobe Keiichi added 4 commits December 30, 2018 01:58

Avoid checkFilesExist check

239cfa4

Minor modifications for clarity

e72bf00

Remove debugStatement test

abfe2e6

Fix title of test case

3708ef1

srowen approved these changes Dec 30, 2018

View reviewed changes

srowen closed this in c0b9db1 Dec 31, 2018

KeiichiHirobe mentioned this pull request Jan 4, 2019

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23446

Closed

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23288

[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23288

Conversation

KeiichiHirobe commented Dec 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Dec 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 14, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

kiszk commented Dec 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KeiichiHirobe commented Dec 15, 2018

Uh oh!

srowen commented Dec 18, 2018

Uh oh!

HyukjinKwon commented Dec 19, 2018

Uh oh!

KeiichiHirobe commented Dec 27, 2018

Uh oh!

HyukjinKwon commented Dec 27, 2018

Uh oh!

KeiichiHirobe commented Dec 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Dec 28, 2018

Uh oh!

KeiichiHirobe commented Dec 28, 2018

Uh oh!

KeiichiHirobe commented Dec 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

KeiichiHirobe commented Dec 11, 2018 •

edited

Loading

HyukjinKwon Dec 14, 2018 •

edited

Loading

kiszk commented Dec 15, 2018 •

edited

Loading

KeiichiHirobe commented Dec 28, 2018 •

edited

Loading

HyukjinKwon commented Jan 1, 2019 •

edited

Loading