-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14557][SQL] Reading textfile (created though CTAS) doesn't work #12356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…k when pathFilter is enabled. 1) A bug in HadoopFileReader. Resolved by passing the directory instead of a list of files in case of pathFilter also, since it gets triggerred in FileInputFormat. This also saves multiple filterings in the codePath. 2) Not using the applyFilterIfApplicable
|
ok to test |
|
Test build #55741 has finished for PR 12356 at commit
|
|
I think we can eliminate applyFilterIfNeeded method as well. |
|
Test build #55796 has finished for PR 12356 at commit
|
|
Can any of the admin verify the above fix? |
|
Resolved the merge conflicts for easy merging |
|
Test build #56660 has finished for PR 12356 at commit
|
|
cc @marmbrus |
|
Resolved the merge conflicts for easy merging |
|
Test build #57504 has finished for PR 12356 at commit
|
|
Is it possible to write unit tests for this? |
|
ok to test |
|
Test build #59146 has finished for PR 12356 at commit
|
|
Sure. Let me add the CTAS query in the test suite |
|
@kasjain Could you add a test case? Does it still fail in the latest master? |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 I believe the author in apache#14807 removed his account. Closes apache#7075 Closes apache#8927 Closes apache#9202 Closes apache#9366 Closes apache#10861 Closes apache#11420 Closes apache#12356 Closes apache#13028 Closes apache#13506 Closes apache#14191 Closes apache#14198 Closes apache#14330 Closes apache#14807 Closes apache#15839 Closes apache#16225 Closes apache#16685 Closes apache#16692 Closes apache#16995 Closes apache#17181 Closes apache#17211 Closes apache#17235 Closes apache#17237 Closes apache#17248 Closes apache#17341 Closes apache#17708 Closes apache#17716 Closes apache#17721 Closes apache#17937 Added: Closes apache#14739 Closes apache#17139 Closes apache#17445 Closes apache#18042 Closes apache#18359 Added: Closes apache#16450 Closes apache#16525 Closes apache#17738 Added: Closes apache#16458 Closes apache#16508 Closes apache#17714 Added: Closes apache#17830 Closes apache#14742 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18417 from HyukjinKwon/close-stale-pr.
What changes were proposed in this pull request?
These changes fix the below broken functionality and does a small performance improvement.
Reading the CSV table (created through CTAS query) doesn't work when the pathFilter is provided.
A bug in HadoopFileReader. Currently, when the pathFilter is provided, it passes the list of filtered files to the "setInputPaths" which wrongly sets the string of incorrectly escaped comma separated files in an array-sequence of size one. This should have been a sequence of size equal to the number of files obtained after filtering. Hence the exception mentioned in the bug.
------ FileInputFormat.setInputPaths(jobConf, Seq[Path](new Path%28path%29): _*)
Secondly, in this flow, filtering is triggered twice for each file. Once in hadoopTableReader.applyFilterIfApplicable and then again in FileInputFormat.singleThreadedListStatus. This is costly and redundant.
To solve both the issues above, we can just pass the directory path itself even when the pathFilter is enabled.
How was this patch tested?
Integration tests, manual tests