Skip to content

Conversation

@KeiichiHirobe
Copy link

@KeiichiHirobe KeiichiHirobe commented Dec 11, 2018

What changes were proposed in this pull request?

As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore, fix this by throwing exception which message is "Path does not exist".

How was this patch tested?

manual tests.
Both of codes below throws exception which message is "Path does not exist".

spark.read.csv("/home/forcia/work/spark/_test.csv")
spark.read.schema("test STRING, number INT").csv("/home/forcia/work/spark/_test.csv")

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@KeiichiHirobe KeiichiHirobe changed the title Throws better exception when reading files that start with underscore [SPARK-26339]Throws better exception when reading files that start with underscore Dec 11, 2018
@KeiichiHirobe KeiichiHirobe changed the title [SPARK-26339]Throws better exception when reading files that start with underscore [SPARK-26339][SQL]Throws better exception when reading files that start with underscore Dec 11, 2018
// Don't need to check once again if files exist in streaming mode
if (checkFilesExist && !fs.exists(globPath.head)) {
if (checkFilesExist &&
(!fs.exists(globPath.head) || InMemoryFileIndex.shouldFilterOut(globPath.head.getName))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably misunderstanding, but doesn't this still cause it to throw a 'Path does not exist' exception?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryFileIndex.shouldFilterOut returns true if argument starts with underscore, so throw a 'Path does not exist' exception. I've checked and exception below was thrown.

org.apache.spark.sql.AnalysisException: Path does not exist: file:_test.csv;
  at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:558)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:244)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:545)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I didn't read carefully. This is the new desired behavior. I agree it would be better to not end up with an odd CSV parsing message. I wonder if we can clarify the message further with a different exception for the new case. The path does exist; it's just ignored.

if (checkFilesExist) {
  val firstPath = globPath.head
  if  (!fs.exists(firstPath)) {
    // ... Path does not exist
  } else if (shouldFilterOut...) {
    // ... Path exists but is ignored
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for understanding my proposal.
Your suggestion looks better,I’ll push later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I pushed now. Could you please check my commit?

if (!fs.exists(firstPath)) {
throw new AnalysisException(s"Path does not exist: ${firstPath}")
} else if (InMemoryFileIndex.shouldFilterOut(firstPath.getName)) {
throw new AnalysisException(s"Path exists but is ignored: ${firstPath}")
Copy link
Member

@HyukjinKwon HyukjinKwon Dec 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing i'm not sure tho, it's going to throw an exception for, for instance,

spark.read.text("_text.txt").show()

instead of returning an empty dataframe - which is kind of a behaviour change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looks it's going to not check children.

if (checkFilesExist && !fs.exists(globPath.head)) {
throw new AnalysisException(s"Path does not exist: ${globPath.head}")
if (checkFilesExist) {
val firstPath = globPath.head
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does it make sense to check only the first file? Looks multiple files could be detected.

@HyukjinKwon
Copy link
Member

I think one thing we could consider is to leave a trace or debug log to show which files are ignored when the file is listed.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we may need a different approach here. Really we want to check all the files in the glob, and maybe log a debug statement about ones that are ignored, and throw an exception only if all of the files are filtered out. Otherwise this fails if the first file happens to have an underscore, but others don't, and there is a convention that _files are just ignored in Hadoop/Spark/etc.

That would cover the original case where the user specified one file that is ignored.

I don't even know if it's a behavior change in many cases, because several things will fail later anyway (like CSV schema inference here) if there are no files. If the user specifies a path that has only underscore files, I could see the argument that it should just produce an empty dataframe, but, that could be surprising as well if there are data files there and it's just happening because of the underscores.

If someone felt strongly about not changing the behavior (elsewhere), then we could just log a warning instead, when all files are filtered out.

@kiszk
Copy link
Member

kiszk commented Dec 15, 2018

Good catch, could you please add test cases that throw this exception for a file and multiple files?

@KeiichiHirobe
Copy link
Author

I agree with srowen's idea.
Most cases, behaviour change makes no problem,I think.
So, May I implement a new behavior, or should I wait for a moment?

@srowen
Copy link
Member

srowen commented Dec 18, 2018

I think we should implement something along the lines of my comment above: #23288 (review)

@HyukjinKwon
Copy link
Member

Yup. I think so too.

Hirobe Keiichi added 3 commits December 25, 2018 20:55
@KeiichiHirobe
Copy link
Author

I implemented a new behavior and added tests.
Please check it!

In checkAndGlobPathIfNecessary, calling InMemoryFileIndex.bulkListLeafFiles.
I am not familiar with cost of calling InMemoryFileIndex.bulkListLeafFiles, is this expensive?

If so, maybe we should implement a new method in object InMemoryFileIndex which returns information of filtered out files/directories and whether filtered files(regular files) exist or not.

And, please note that I considerd for not only InMemoryFileIndex.shouldFilterOut but also PathFilter.accept.

@HyukjinKwon
Copy link
Member

Yea, listing files itself is non-trivial in some cases (in particular when you use, for instance, S3) and extra listing up should be avoided. Looks the change is a bit invasive, and does another file listing.

Can we simply do the check when the existing file list happens, rather than doing another file listing?

@KeiichiHirobe
Copy link
Author

KeiichiHirobe commented Dec 28, 2018

Thank you for your reply.

Let me make sure what you mean,
I don't know for certain whether I should list filterOuted files/dirs recursively or list files/dirs only directly under specified path.

For the example below, when spark.read.csv("foo") which is the best one that we should debug log?
Now, we debugging No.1.

  1. _a.csv , _b.csv and _bar
  2. _a.csv and _bar
  3. _a.csv

I noticed below behavior for your information.
bulkListLeafFiles lists _a.csv and _b.csv(recursively listed)
spark.read.csv("foo").show lists only a.csv(read only files directly under specified path)

foo/
 ├ a.csv
 ├ _a.csv
 ├ bar/
 │ ├ b.csv
 │ └ _b.csv
 ├ _bar/
 │ ├ c.csv
 │ ├ _c.csv

@srowen
Copy link
Member

srowen commented Dec 28, 2018

I don't think any of these methods like CSV recurse into subdirectories; you can supply globs that specify files across subdirs. I don't think the path matters here, just the filenames that match. That's what needs to be checked after listing all matching files.

@KeiichiHirobe
Copy link
Author

I get the point.
Looking at #23288 (comment),
I misunderstood we should check recursively if specified dir.

I'll push later.

@KeiichiHirobe
Copy link
Author

I pushed now.
Could you please check my commit?

}

test("SPARK-26339 Debug statement if some of specified paths are filtered out") {
class TestAppender extends AppenderSkeleton {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't bother with this complexity to test if the debug log was printed; it's not important compared to the additional binding to log4j.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

try {
val cars = spark
.read
.format("csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can use .csv instead of format and load

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

}.toSeq

if (checkFilesExist) {
val (filtered, filteredOut) = allGlobPath.partition { path =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd call filtered as filteredIn to avoid ambiguity. It might also be very slightly cleaner to avoid the ! in the expression and flip these two values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

if (filteredOut.nonEmpty) {
if (filtered.isEmpty) {
throw new AnalysisException(
"All path were ignored. The following path were ignored:\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path -> paths. Also, it seems clearer to say: "All paths were ignored:\n" and below, "Some paths were ignored:\n"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

s"${filteredOut.mkString("\n ")}")
} else {
logDebug(
"The following path were ignored:\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for performance, make this one interpolated string. If the line is too long make the variable filteredOut something shorter like out

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

globPath
}.toSeq

if (checkFilesExist) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to remove the check and exception a few lines above then? It would fail if any path didn't have some files. (Also feel free to fix the indentation from line 549-558 above)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but irregular indentations seems to be due to GitHub preview CSS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need, I fixed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about removing that check entirely?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already removed that check at
239cfa4#diff-7a6cb188d2ae31eb3347b5629a679cecR563

Or are you refering to checkFilesExist at line 557 and suggesting removing argument checkFilesExist ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I mean line 557. I guess we can keep that because, overall, we are trying to throw AnalysisException in more cases, not fewer. Before, if one of several glob paths matched no files at all (underscore or not) it would throw. OK, that behavior we can keep, I guess, or at least that's a separate question.

Disregard this; I think it is OK.

@srowen
Copy link
Member

srowen commented Dec 31, 2018

Merged to master

@srowen srowen closed this in c0b9db1 Dec 31, 2018
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 1, 2019

@srowen, this didn't run the test! Looks some tests are being broken

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100599/testReport/org.apache.spark.sql/SQLQuerySuite/SPARK_19059__read_file_based_table_whose_name_starts_with_underscore/

java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: All paths were ignored: file:/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/spark-warehouse/_tbl;

Reverting this.

@HyukjinKwon
Copy link
Member

@KeiichiHirobe, mind opening a PR again please? I also missed that the test didn't actually run. Looks the current change breaks another regression test. Can you take a look and fix it together?

@srowen
Copy link
Member

srowen commented Jan 1, 2019

Ack, darn, thank you. I was looking at a bunch of open PRs and probably looked at the wrong one to see if tests had run.

KeiichiHirobe pushed a commit to KeiichiHirobe/spark that referenced this pull request Jan 4, 2019
CheckFileExist was avoided at 239cfa4 after discussing apache#23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.
Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555
failed when we avoided checkFileExist, but now successed after this commit.
@KeiichiHirobe
Copy link
Author

@HyukjinKwon @srowen
According to this, we can not reopen pull request, so I created a new pull request #23446 . Could you please review my commit at #23446?

srowen pushed a commit that referenced this pull request Jan 6, 2019
…art with underscore

## What changes were proposed in this pull request?
My pull request #23288 was resolved and merged to master, but it turned out  later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here.
Commit 92934b4 is only change after pull request #23288.
`CheckFileExist` was avoided at 239cfa4 after discussing #23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.

Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555
failed when we avoided checkFileExist, but now successed after commit 92934b4 .

## How was this patch tested?
Both of below tests were passed.
```
testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite
testOnly org.apache.spark.sql.SQLQuerySuite
```

Closes #23446 from KeiichiHirobe/SPARK-26339.

Authored-by: Hirobe Keiichi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…art with underscore

## What changes were proposed in this pull request?
As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore,  fix this by throwing exception which message is "Path does not exist".

## How was this patch tested?
manual tests.
Both of codes below throws exception which message is "Path does not exist".
```
spark.read.csv("/home/forcia/work/spark/_test.csv")
spark.read.schema("test STRING, number INT").csv("/home/forcia/work/spark/_test.csv")
```

Closes apache#23288 from KeiichiHirobe/SPARK-26339.

Authored-by: Hirobe Keiichi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…art with underscore

## What changes were proposed in this pull request?
My pull request apache#23288 was resolved and merged to master, but it turned out  later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here.
Commit 92934b4 is only change after pull request apache#23288.
`CheckFileExist` was avoided at 239cfa4 after discussing apache#23288 (comment).
But, that change turned out to be wrong because we should not check if argument checkFileExist is false.

Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555
failed when we avoided checkFileExist, but now successed after commit 92934b4 .

## How was this patch tested?
Both of below tests were passed.
```
testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite
testOnly org.apache.spark.sql.SQLQuerySuite
```

Closes apache#23446 from KeiichiHirobe/SPARK-26339.

Authored-by: Hirobe Keiichi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants