[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760

olaky · 2024-05-27T15:06:23Z

What changes were proposed in this pull request?

SPARK-47657 allows to push filters on collated columns to file sources that support it. If such filters are pushed to file sources, those file sources must not push those filters to the actual file readers (i.e. parquet or csv readers), because there is no guarantee that those support collations.

In this PR we are widening filters on collations to be AlwaysTrue when we translate filters for file sources.

Why are the changes needed?

Without this, no file source can implement filter pushdown

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests. No component tests are possible because there is no file source with filter pushdown yet.

Was this patch authored or co-authored using generative AI tooling?

No

…eaders

olaky · 2024-05-27T15:06:43Z

cc @stefankandic @cloud-fan

cloud-fan · 2024-05-28T07:07:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        }
+        filter
+
+      case p if p.references.exists(ref => SchemaUtils.hasNonUTF8BinaryCollation(ref.dataType)) =>


should this be case p if !p.isInstanceOf[expressions.IsNotNull] && !p.isInstanceOf[expressions.IsNull] ...

hmmm, what if it's Not(Not(IsNotNull(...)))? I feel checking references at this layer is risky. Shall we do it in translateLeafNodeFilter?

Just checking for the IsNull/IsNotNull types is not enough. It would not cover IsNotNull(Min(<collated_col>, x)). So it is important that IsNotNull / IsNotNull is directly wrapped around an attribute reference.
We can actually only do this predicate widening at the top level, particularly because of the Nots. If we for example have Not(EqualTo(<collated_col>, x)) we would end up with Not(AlwaysTrue) if we do the transformation in translateLeafNodeFilter, and that is incorrect

cloud-fan · 2024-05-28T16:45:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        // The filter cannot be pushed and we widen it to be AlwaysTrue().
+        val filter = translateLeafNodeFilter(Literal.TrueLiteral,
+          PushableColumn(nestedPredicatePushdownEnabled))
+        if (filter.isDefined && translatedFilterToExpr.isDefined) {


This piece of code is repeated 4 times within translateFilterWithMapping, can we create an inner method to avoid code duplication?

cloud-fan · 2024-05-29T20:45:40Z

thanks, merging to master!

### What changes were proposed in this pull request? Adding a new classes of filters which are collation aware. ### Why are the changes needed? #46760 Added the logic of predicate widening for collated column references, but this would completely change the filters and if the original expression did not get evaluated by spark later we could end up with wrong results. Also, data sources would never be able to actually support these filters and they would just see them as AlwaysTrue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47059 from stefankandic/fixPredicateWidening. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-48431] Do not forward predicates on collated columns to file r…

0a5db10

…eaders

github-actions bot added the SQL label May 27, 2024

stefankandic approved these changes May 27, 2024

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-48431] Do not forward predicates on collated columns to file readers~~ [SPARK-48431][SQL] Do not forward predicates on collated columns to file readers May 27, 2024

cloud-fan reviewed May 28, 2024

View reviewed changes

olaky requested a review from cloud-fan May 28, 2024 11:15

cloud-fan reviewed May 28, 2024

View reviewed changes

cloud-fan approved these changes May 28, 2024

View reviewed changes

Introduce translateAndRecordLeafNodeFilter

097b710

olaky requested a review from cloud-fan May 29, 2024 06:47

Fix build

1dd1acd

cloud-fan approved these changes May 29, 2024

View reviewed changes

cloud-fan closed this in a3b8420 May 29, 2024

stefankandic mentioned this pull request Jun 24, 2024

[SPARK-48697][SQL] Add collation aware string filters #47059

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760

[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760

Uh oh!

olaky commented May 27, 2024

Uh oh!

olaky commented May 27, 2024

Uh oh!

cloud-fan May 28, 2024

Uh oh!

cloud-fan May 28, 2024

Uh oh!

olaky May 28, 2024 •

edited

Loading

Uh oh!

cloud-fan May 28, 2024

Uh oh!

cloud-fan commented May 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760

[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760

Uh oh!

Conversation

olaky commented May 27, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

olaky commented May 27, 2024

Uh oh!

cloud-fan May 28, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 28, 2024

Choose a reason for hiding this comment

Uh oh!

olaky May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 28, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

olaky May 28, 2024 •

edited

Loading