Skip to content

Conversation

@olaky
Copy link
Contributor

@olaky olaky commented May 27, 2024

What changes were proposed in this pull request?

SPARK-47657 allows to push filters on collated columns to file sources that support it. If such filters are pushed to file sources, those file sources must not push those filters to the actual file readers (i.e. parquet or csv readers), because there is no guarantee that those support collations.

In this PR we are widening filters on collations to be AlwaysTrue when we translate filters for file sources.

Why are the changes needed?

Without this, no file source can implement filter pushdown

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests. No component tests are possible because there is no file source with filter pushdown yet.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label May 27, 2024
@olaky
Copy link
Contributor Author

olaky commented May 27, 2024

cc @stefankandic @cloud-fan

@HyukjinKwon HyukjinKwon changed the title [SPARK-48431] Do not forward predicates on collated columns to file readers [SPARK-48431][SQL] Do not forward predicates on collated columns to file readers May 27, 2024
}
filter

case p if p.references.exists(ref => SchemaUtils.hasNonUTF8BinaryCollation(ref.dataType)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be case p if !p.isInstanceOf[expressions.IsNotNull] && !p.isInstanceOf[expressions.IsNull] ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, what if it's Not(Not(IsNotNull(...)))? I feel checking references at this layer is risky. Shall we do it in translateLeafNodeFilter?

Copy link
Contributor Author

@olaky olaky May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking for the IsNull/IsNotNull types is not enough. It would not cover IsNotNull(Min(<collated_col>, x)). So it is important that IsNotNull / IsNotNull is directly wrapped around an attribute reference.
We can actually only do this predicate widening at the top level, particularly because of the Nots. If we for example have Not(EqualTo(<collated_col>, x)) we would end up with Not(AlwaysTrue) if we do the transformation in translateLeafNodeFilter, and that is incorrect

@olaky olaky requested a review from cloud-fan May 28, 2024 11:15
// The filter cannot be pushed and we widen it to be AlwaysTrue().
val filter = translateLeafNodeFilter(Literal.TrueLiteral,
PushableColumn(nestedPredicatePushdownEnabled))
if (filter.isDefined && translatedFilterToExpr.isDefined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This piece of code is repeated 4 times within translateFilterWithMapping, can we create an inner method to avoid code duplication?

@olaky olaky requested a review from cloud-fan May 29, 2024 06:47
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a3b8420 May 29, 2024
cloud-fan pushed a commit that referenced this pull request Jul 1, 2024
### What changes were proposed in this pull request?

Adding a new classes of filters which are collation aware.

### Why are the changes needed?

#46760 Added the logic of predicate widening for collated column references, but this would completely change the filters and if the original expression did not get evaluated by spark later we could end up with wrong results. Also, data sources would never be able to actually support these filters and they would just see them as AlwaysTrue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UTs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47059 from stefankandic/fixPredicateWidening.

Authored-by: Stefan Kandic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants