Commit b9cfad8
[SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet
### What changes were proposed in this pull request?
In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib.
Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values.
After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, **-719162** is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to **-719164**. For more context, the PR description apache#28067 shows the diffs between two calendars.
### Why are the changes needed?
The changes fix the bug portrayed by the following example from SPARK-36034:
```scala
In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
>>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
>>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show()
+----+
|date|
+----+
+----+
```
The result must have the date value `0001-01-01`.
### Does this PR introduce _any_ user-facing change?
In some sense, yes. Query results can be different in some cases. For the example above:
```scala
scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")
scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false)
+----------+
|date |
+----------+
|0001-01-01|
+----------+
```
### How was this patch tested?
By running the modified test suite `ParquetFilterSuite`:
```
$ build/sbt "test:testOnly *ParquetV1FilterSuite"
$ build/sbt "test:testOnly *ParquetV2FilterSuite"
```
Closes apache#33347 from MaxGekk/fix-parquet-ts-filter-pushdown.
Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit b09b7f7)1 parent 8370ad2 commit b9cfad8
File tree
5 files changed
+369
-94
lines changed- sql/core/src
- main/scala/org/apache/spark/sql/execution/datasources
- parquet
- v2/parquet
- test/scala/org/apache/spark/sql/execution/datasources/parquet
5 files changed
+369
-94
lines changedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Lines changed: 12 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
268 | 268 | | |
269 | 269 | | |
270 | 270 | | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
271 | 274 | | |
272 | 275 | | |
273 | 276 | | |
274 | | - | |
275 | | - | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
276 | 286 | | |
277 | 287 | | |
278 | 288 | | |
| |||
298 | 308 | | |
299 | 309 | | |
300 | 310 | | |
301 | | - | |
302 | | - | |
303 | | - | |
304 | 311 | | |
305 | 312 | | |
306 | 313 | | |
| |||
Lines changed: 22 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| |||
47 | 49 | | |
48 | 50 | | |
49 | 51 | | |
50 | | - | |
| 52 | + | |
| 53 | + | |
51 | 54 | | |
52 | 55 | | |
53 | 56 | | |
| |||
123 | 126 | | |
124 | 127 | | |
125 | 128 | | |
126 | | - | |
127 | | - | |
128 | | - | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
129 | 138 | | |
130 | 139 | | |
131 | | - | |
132 | | - | |
133 | | - | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
134 | 149 | | |
135 | 150 | | |
136 | 151 | | |
| |||
Lines changed: 12 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
136 | 139 | | |
137 | 140 | | |
138 | 141 | | |
139 | | - | |
140 | | - | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
141 | 151 | | |
142 | 152 | | |
143 | 153 | | |
| |||
170 | 180 | | |
171 | 181 | | |
172 | 182 | | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | 183 | | |
177 | 184 | | |
178 | 185 | | |
| |||
Lines changed: 13 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
50 | 51 | | |
51 | 52 | | |
52 | 53 | | |
53 | | - | |
54 | | - | |
55 | | - | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
56 | 66 | | |
57 | 67 | | |
58 | 68 | | |
| |||
0 commit comments