[SPARK-48697][SQL] Add collation aware string filters #47059

stefankandic · 2024-06-21T16:33:53Z

What changes were proposed in this pull request?

Adding a new classes of filters which are collation aware.

Why are the changes needed?

#46760 Added the logic of predicate widening for collated column references, but this would completely change the filters and if the original expression did not get evaluated by spark later we could end up with wrong results. Also, data sources would never be able to actually support these filters and they would just see them as AlwaysTrue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

cloud-fan · 2024-06-24T15:05:39Z

I think there is some misunderstanding here. Filter pushdown has a few steps:

Spark translates catalyst filters to data source filters, which can be a semantically subset as some catalyst filters do not have corresponding data source filters.
Spark pushes down the data source filters to data source implementation.
Data source implementation tells Spark which filters need to be evaluated again at Spark side. See DS v2 SupportsPushDownFilters.pushFilters, which returns to-be-evaluated-by-Spark filters.

I don't get why we need TranslatedFilter, as the problem is not from the translation layer.

stefankandic · 2024-06-27T12:44:22Z

@cloud-fan I made some changes per our discussion, let me know what you think

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

...ore/src/test/scala/org/apache/spark/sql/collation/CollatedFilterPushDownToParquetSuite.scala

cloud-fan · 2024-07-01T15:06:54Z

thanks, merging to master!

github-actions bot added the SQL label Jun 21, 2024

stefankandic changed the title ~~[WIP] Fix collation predicate widening~~ [SPARK-48697] Fix collation predicate widening Jun 24, 2024

stefankandic marked this pull request as ready for review June 24, 2024 11:54

uros-db reviewed Jun 24, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala Outdated Show resolved Hide resolved

stefankandic added 10 commits June 25, 2024 15:26

initial working version

b698def

method rename

45b87ef

fix import error

0a2dbc2

add replace part to create table

aabb1c1

fix syntax error in sql

11b97c4

clean up tests a bit

b859408

remove translated filter

410a26c

somewhat working with new ds1 filters

b958fc0

clean up tests

f3aeb0b

add docstring for new filters

a4458b6

stefankandic force-pushed the fixPredicateWidening branch from f6c6bd7 to a4458b6 Compare June 27, 2024 12:03

stefankandic added 4 commits June 27, 2024 14:06

delete unused method

becd8dc

add more tests

72c53f0

small refactor

e4a29b9

fix scalastyle

3142121

stefankandic changed the title ~~[SPARK-48697] Fix collation predicate widening~~ [SPARK-48697] Properly translate filters on collated columns Jun 27, 2024

stefankandic changed the title ~~[SPARK-48697] Properly translate filters on collated columns~~ [SPARK-48697] Add collation aware string filters Jun 27, 2024

delete old tests

0d3b2f9

stefankandic changed the title ~~[SPARK-48697] Add collation aware string filters~~ [SPARK-48697][SQL] Add collation aware string filters Jun 27, 2024

cloud-fan reviewed Jun 28, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala Show resolved Hide resolved

cloud-fan reviewed Jun 28, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala Show resolved Hide resolved

cloud-fan reviewed Jun 28, 2024

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/collation/CollatedFilterPushDownToParquetSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jun 28, 2024

View reviewed changes

add evolving to all apis

9284555

cloud-fan approved these changes Jul 1, 2024

View reviewed changes

cloud-fan closed this in 703b076 Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48697][SQL] Add collation aware string filters #47059

[SPARK-48697][SQL] Add collation aware string filters #47059

Uh oh!

stefankandic commented Jun 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

stefankandic commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jul 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-48697][SQL] Add collation aware string filters #47059

[SPARK-48697][SQL] Add collation aware string filters #47059

Uh oh!

Conversation

stefankandic commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

stefankandic commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jul 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefankandic commented Jun 21, 2024 •

edited

Loading