-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48431][SQL] Do not forward predicates on collated columns to file readers #46760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| filter | ||
|
|
||
| case p if p.references.exists(ref => SchemaUtils.hasNonUTF8BinaryCollation(ref.dataType)) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be case p if !p.isInstanceOf[expressions.IsNotNull] && !p.isInstanceOf[expressions.IsNull] ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, what if it's Not(Not(IsNotNull(...)))? I feel checking references at this layer is risky. Shall we do it in translateLeafNodeFilter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking for the IsNull/IsNotNull types is not enough. It would not cover IsNotNull(Min(<collated_col>, x)). So it is important that IsNotNull / IsNotNull is directly wrapped around an attribute reference.
We can actually only do this predicate widening at the top level, particularly because of the Nots. If we for example have Not(EqualTo(<collated_col>, x)) we would end up with Not(AlwaysTrue) if we do the transformation in translateLeafNodeFilter, and that is incorrect
| // The filter cannot be pushed and we widen it to be AlwaysTrue(). | ||
| val filter = translateLeafNodeFilter(Literal.TrueLiteral, | ||
| PushableColumn(nestedPredicatePushdownEnabled)) | ||
| if (filter.isDefined && translatedFilterToExpr.isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of code is repeated 4 times within translateFilterWithMapping, can we create an inner method to avoid code duplication?
|
thanks, merging to master! |
### What changes were proposed in this pull request? Adding a new classes of filters which are collation aware. ### Why are the changes needed? #46760 Added the logic of predicate widening for collated column references, but this would completely change the filters and if the original expression did not get evaluated by spark later we could end up with wrong results. Also, data sources would never be able to actually support these filters and they would just see them as AlwaysTrue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47059 from stefankandic/fixPredicateWidening. Authored-by: Stefan Kandic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
SPARK-47657 allows to push filters on collated columns to file sources that support it. If such filters are pushed to file sources, those file sources must not push those filters to the actual file readers (i.e. parquet or csv readers), because there is no guarantee that those support collations.
In this PR we are widening filters on collations to be AlwaysTrue when we translate filters for file sources.
Why are the changes needed?
Without this, no file source can implement filter pushdown
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests. No component tests are possible because there is no file source with filter pushdown yet.
Was this patch authored or co-authored using generative AI tooling?
No