-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26339][SQL]Throws better exception when reading files that start with underscore #23288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
| // Don't need to check once again if files exist in streaming mode | ||
| if (checkFilesExist && !fs.exists(globPath.head)) { | ||
| if (checkFilesExist && | ||
| (!fs.exists(globPath.head) || InMemoryFileIndex.shouldFilterOut(globPath.head.getName))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably misunderstanding, but doesn't this still cause it to throw a 'Path does not exist' exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InMemoryFileIndex.shouldFilterOut returns true if argument starts with underscore, so throw a 'Path does not exist' exception. I've checked and exception below was thrown.
org.apache.spark.sql.AnalysisException: Path does not exist: file:_test.csv;
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:558)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:244)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I didn't read carefully. This is the new desired behavior. I agree it would be better to not end up with an odd CSV parsing message. I wonder if we can clarify the message further with a different exception for the new case. The path does exist; it's just ignored.
if (checkFilesExist) {
val firstPath = globPath.head
if (!fs.exists(firstPath)) {
// ... Path does not exist
} else if (shouldFilterOut...) {
// ... Path exists but is ignored
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for understanding my proposal.
Your suggestion looks better,I’ll push later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I pushed now. Could you please check my commit?
| if (!fs.exists(firstPath)) { | ||
| throw new AnalysisException(s"Path does not exist: ${firstPath}") | ||
| } else if (InMemoryFileIndex.shouldFilterOut(firstPath.getName)) { | ||
| throw new AnalysisException(s"Path exists but is ignored: ${firstPath}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing i'm not sure tho, it's going to throw an exception for, for instance,
spark.read.text("_text.txt").show()
instead of returning an empty dataframe - which is kind of a behaviour change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, looks it's going to not check children.
| if (checkFilesExist && !fs.exists(globPath.head)) { | ||
| throw new AnalysisException(s"Path does not exist: ${globPath.head}") | ||
| if (checkFilesExist) { | ||
| val firstPath = globPath.head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does it make sense to check only the first file? Looks multiple files could be detected.
|
I think one thing we could consider is to leave a trace or debug log to show which files are ignored when the file is listed. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we may need a different approach here. Really we want to check all the files in the glob, and maybe log a debug statement about ones that are ignored, and throw an exception only if all of the files are filtered out. Otherwise this fails if the first file happens to have an underscore, but others don't, and there is a convention that _files are just ignored in Hadoop/Spark/etc.
That would cover the original case where the user specified one file that is ignored.
I don't even know if it's a behavior change in many cases, because several things will fail later anyway (like CSV schema inference here) if there are no files. If the user specifies a path that has only underscore files, I could see the argument that it should just produce an empty dataframe, but, that could be surprising as well if there are data files there and it's just happening because of the underscores.
If someone felt strongly about not changing the behavior (elsewhere), then we could just log a warning instead, when all files are filtered out.
|
Good catch, could you please add test cases that throw this exception for a file and multiple files? |
|
I agree with srowen's idea. |
|
I think we should implement something along the lines of my comment above: #23288 (review) |
|
Yup. I think so too. |
|
I implemented a new behavior and added tests. In If so, maybe we should implement a new method in And, please note that I considerd for not only |
|
Yea, listing files itself is non-trivial in some cases (in particular when you use, for instance, S3) and extra listing up should be avoided. Looks the change is a bit invasive, and does another file listing. Can we simply do the check when the existing file list happens, rather than doing another file listing? |
|
Thank you for your reply. Let me make sure what you mean, For the example below, when
I noticed below behavior for your information. foo/ |
|
I don't think any of these methods like CSV recurse into subdirectories; you can supply globs that specify files across subdirs. I don't think the path matters here, just the filenames that match. That's what needs to be checked after listing all matching files. |
|
I get the point. I'll push later. |
|
I pushed now. |
| } | ||
|
|
||
| test("SPARK-26339 Debug statement if some of specified paths are filtered out") { | ||
| class TestAppender extends AppenderSkeleton { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother with this complexity to test if the debug log was printed; it's not important compared to the additional binding to log4j.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
| try { | ||
| val cars = spark | ||
| .read | ||
| .format("csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you can use .csv instead of format and load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
| }.toSeq | ||
|
|
||
| if (checkFilesExist) { | ||
| val (filtered, filteredOut) = allGlobPath.partition { path => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd call filtered as filteredIn to avoid ambiguity. It might also be very slightly cleaner to avoid the ! in the expression and flip these two values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
| if (filteredOut.nonEmpty) { | ||
| if (filtered.isEmpty) { | ||
| throw new AnalysisException( | ||
| "All path were ignored. The following path were ignored:\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path -> paths. Also, it seems clearer to say: "All paths were ignored:\n" and below, "Some paths were ignored:\n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
| s"${filteredOut.mkString("\n ")}") | ||
| } else { | ||
| logDebug( | ||
| "The following path were ignored:\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for performance, make this one interpolated string. If the line is too long make the variable filteredOut something shorter like out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it.
| globPath | ||
| }.toSeq | ||
|
|
||
| if (checkFilesExist) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to remove the check and exception a few lines above then? It would fail if any path didn't have some files. (Also feel free to fix the indentation from line 549-558 above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but irregular indentations seems to be due to GitHub preview CSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about removing that check entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already removed that check at
239cfa4#diff-7a6cb188d2ae31eb3347b5629a679cecR563
Or are you refering to checkFilesExist at line 557 and suggesting removing argument checkFilesExist ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I mean line 557. I guess we can keep that because, overall, we are trying to throw AnalysisException in more cases, not fewer. Before, if one of several glob paths matched no files at all (underscore or not) it would throw. OK, that behavior we can keep, I guess, or at least that's a separate question.
Disregard this; I think it is OK.
|
Merged to master |
|
@srowen, this didn't run the test! Looks some tests are being broken Reverting this. |
|
@KeiichiHirobe, mind opening a PR again please? I also missed that the test didn't actually run. Looks the current change breaks another regression test. Can you take a look and fix it together? |
|
Ack, darn, thank you. I was looking at a bunch of open PRs and probably looked at the wrong one to see if tests had run. |
CheckFileExist was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after this commit.
|
@HyukjinKwon @srowen |
…art with underscore ## What changes were proposed in this pull request? My pull request #23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here. Commit 92934b4 is only change after pull request #23288. `CheckFileExist` was avoided at 239cfa4 after discussing #23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after commit 92934b4 . ## How was this patch tested? Both of below tests were passed. ``` testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite testOnly org.apache.spark.sql.SQLQuerySuite ``` Closes #23446 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…art with underscore
## What changes were proposed in this pull request?
As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore, fix this by throwing exception which message is "Path does not exist".
## How was this patch tested?
manual tests.
Both of codes below throws exception which message is "Path does not exist".
```
spark.read.csv("/home/forcia/work/spark/_test.csv")
spark.read.schema("test STRING, number INT").csv("/home/forcia/work/spark/_test.csv")
```
Closes apache#23288 from KeiichiHirobe/SPARK-26339.
Authored-by: Hirobe Keiichi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
…art with underscore ## What changes were proposed in this pull request? My pull request apache#23288 was resolved and merged to master, but it turned out later that my change breaks another regression test. Because we cannot reopen pull request, I create a new pull request here. Commit 92934b4 is only change after pull request apache#23288. `CheckFileExist` was avoided at 239cfa4 after discussing apache#23288 (comment). But, that change turned out to be wrong because we should not check if argument checkFileExist is false. Test https://github.com/apache/spark/blob/27e42c1de502da80fa3e22bb69de47fb00158174/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L2555 failed when we avoided checkFileExist, but now successed after commit 92934b4 . ## How was this patch tested? Both of below tests were passed. ``` testOnly org.apache.spark.sql.execution.datasources.csv.CSVSuite testOnly org.apache.spark.sql.SQLQuerySuite ``` Closes apache#23446 from KeiichiHirobe/SPARK-26339. Authored-by: Hirobe Keiichi <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
As the description in SPARK-26339, spark.read behavior is very confusing when reading files that start with underscore, fix this by throwing exception which message is "Path does not exist".
How was this patch tested?
manual tests.
Both of codes below throws exception which message is "Path does not exist".