Skip to content

Conversation

@kasjain
Copy link
Contributor

@kasjain kasjain commented Apr 13, 2016

What changes were proposed in this pull request?

These changes fix the below broken functionality and does a small performance improvement.

  1. Reading the CSV table (created through CTAS query) doesn't work when the pathFilter is provided.
    A bug in HadoopFileReader. Currently, when the pathFilter is provided, it passes the list of filtered files to the "setInputPaths" which wrongly sets the string of incorrectly escaped comma separated files in an array-sequence of size one. This should have been a sequence of size equal to the number of files obtained after filtering. Hence the exception mentioned in the bug.
    ------ FileInputFormat.setInputPaths(jobConf, Seq[Path](new Path%28path%29): _*)

  2. Secondly, in this flow, filtering is triggered twice for each file. Once in hadoopTableReader.applyFilterIfApplicable and then again in FileInputFormat.singleThreadedListStatus. This is costly and redundant.

To solve both the issues above, we can just pass the directory path itself even when the pathFilter is enabled.

How was this patch tested?

Integration tests, manual tests

…k when

pathFilter is enabled.
1) A bug in HadoopFileReader. Resolved by passing the directory instead
of a list of files in case of pathFilter also, since it gets triggerred in
FileInputFormat. This also saves multiple filterings in the codePath.
2) Not using the applyFilterIfApplicable
@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Apr 13, 2016

Test build #55741 has finished for PR 12356 at commit 48a598e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@liancheng @yhuai?

@saucam
Copy link

saucam commented Apr 14, 2016

I think we can eliminate applyFilterIfNeeded method as well.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55796 has finished for PR 12356 at commit 6bd529c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kasjain
Copy link
Contributor Author

kasjain commented Apr 18, 2016

Can any of the admin verify the above fix?

@kasjain
Copy link
Contributor Author

kasjain commented Apr 22, 2016

Resolved the merge conflicts for easy merging

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56660 has finished for PR 12356 at commit ca9a160.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 30, 2016

cc @marmbrus

@kasjain
Copy link
Contributor Author

kasjain commented May 2, 2016

Resolved the merge conflicts for easy merging

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57504 has finished for PR 12356 at commit 5ba453b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Is it possible to write unit tests for this?

@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59146 has finished for PR 12356 at commit 5ba453b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kasjain
Copy link
Contributor Author

kasjain commented May 24, 2016

Sure. Let me add the CTAS query in the test suite

@gatorsmile
Copy link
Member

@kasjain Could you add a test case? Does it still fail in the latest master?

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs, mostly the same instances with apache#18017

I believe the author in apache#14807 removed his account.

Closes apache#7075
Closes apache#8927
Closes apache#9202
Closes apache#9366
Closes apache#10861
Closes apache#11420
Closes apache#12356
Closes apache#13028
Closes apache#13506
Closes apache#14191
Closes apache#14198
Closes apache#14330
Closes apache#14807
Closes apache#15839
Closes apache#16225
Closes apache#16685
Closes apache#16692
Closes apache#16995
Closes apache#17181
Closes apache#17211
Closes apache#17235
Closes apache#17237
Closes apache#17248
Closes apache#17341
Closes apache#17708
Closes apache#17716
Closes apache#17721
Closes apache#17937

Added:
Closes apache#14739
Closes apache#17139
Closes apache#17445
Closes apache#18042
Closes apache#18359

Added:
Closes apache#16450
Closes apache#16525
Closes apache#17738

Added:
Closes apache#16458
Closes apache#16508
Closes apache#17714

Added:
Closes apache#17830
Closes apache#14742

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18417 from HyukjinKwon/close-stale-pr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants