[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro #27953

MaxGekk · 2020-03-18T18:36:35Z

What changes were proposed in this pull request?

The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Avro datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:

-719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into Avro files.
-719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:

spark.sql.legacy.avro.rebaseDateTime.enabled

which is set to false by default which means the rebasing is not performed by default.

The details of the implementation:

Re-use 2 methods of DateTimeUtils added by the PR [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915 for rebasing microseconds.
Re-use 2 methods of DateTimeUtils added by the PR [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet #27915 for rebasing days.
Use rebaseGregorianToJulianMicros() and rebaseGregorianToJulianDays() while saving timestamps/dates to Avro files if the SQL config is on.
Use rebaseJulianToGregorianMicros() and rebaseJulianToGregorianDays() while loading timestamps/dates from Avro files if the SQL config is on.
The SQL config spark.sql.legacy.avro.rebaseDateTime.enabled controls conversions from/to dates, and timestamps of the timestamp-millis, timestamp-micros logical types.

Why are the changes needed?

For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.

Does this PR introduce any user-facing change?

Yes, the timestamp 1001-01-01 01:02:03.123456 saved by Spark 2.4.5 as timestamp-micros is interpreted by Spark 3.0.0-preview2 differently:

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-07|
+----------+

After the changes:

scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-01|
+----------+

How was this patch tested?

Added tests to AvroLogicalTypeSuite to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via:

$ export TZ="America/Los_Angeles"

scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro")

scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df2: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> :paste
// Entering paste mode (ctrl-D to finish)

  val timestampSchema = s"""
    |  {
    |    "namespace": "logical",
    |    "type": "record",
    |    "name": "test",
    |    "fields": [
    |      {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null}
    |    ]
    |  }
    |""".stripMargin

// Exiting paste mode, now interpreting.
scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")

Added the following tests to AvroLogicalTypeSuite to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. :

rebasing microseconds timestamps in write
rebasing milliseconds timestamps in write
rebasing dates in write

…-datetime # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

MaxGekk · 2020-03-18T18:37:01Z

cc @cloud-fan

SparkQA · 2020-03-18T23:16:50Z

Test build #120003 has finished for PR 27953 at commit 9a96af0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…-datetime # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

MaxGekk · 2020-03-19T10:04:12Z

@cloud-fan I cannot test the timestamp-millis saved by Spark 2.4.5 because it doesn't allow to save timestamps of the type, see:

scala> :paste
// Entering paste mode (ctrl-D to finish)

  val timestampSchema = s"""
      {
        "namespace": "logical",
        "type": "record",
        "name": "test",
        "fields": [
          {"name": "ts", "type": {"type": "long","logicalType": "timestamp-millis"}}
        ]
      }
    """

// Exiting paste mode, now interpreting.

timestampSchema: String =
"
      {
        "namespace": "logical",
        "type": "record",
        "name": "test",
        "fields": [
          {"name": "ts", "type": {"type": "long","logicalType": "timestamp-millis"}}
        ]
      }
    "

scala> val df3 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df3: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")
20/03/19 03:00:37 ERROR Utils: Aborting task
org.apache.avro.AvroRuntimeException: Not a union: {"type":"long","logicalType":"timestamp-millis"}

The same works on the master.

MaxGekk · 2020-03-19T10:41:07Z

@gengliangwang Is it possible to save timestamps as timestamp-millis by Spark 2.4? see #27953 (comment)

MaxGekk · 2020-03-19T11:00:40Z

@cloud-fan @HyukjinKwon Please, review the PR.

cloud-fan · 2020-03-19T13:12:46Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala

+    val tsStr = "1001-01-01 01:02:03.123456"
+    val rebased = "1001-01-01 01:02:03.123"
+    val nonRebased = "1001-01-07 01:09:05.123"
+    val timestampSchema = """


nit: can we use the Scala multi line string?

""" |abc |xyz """.stripMargin

cloud-fan · 2020-03-19T13:14:46Z

Is it possible to save timestamps as timestamp-millis by Spark 2.4?

It's probably because the actual column is nullable (after the cast), but the specified schema is not. Maybe we've fixed something in 3.0.

SparkQA · 2020-03-19T14:47:17Z

Test build #120045 has finished for PR 27953 at commit 2e1cee1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait V1FallbackWriters extends V2CommandExec with SupportsV1Write
abstract class V2CommandExec extends SparkPlan
trait V2TableWriteExec extends V2CommandExec with UnaryExecNode

SparkQA · 2020-03-19T15:45:46Z

Test build #120048 has finished for PR 27953 at commit dac03f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-03-19T18:12:38Z

It's probably because the actual column is nullable ...

@cloud-fan You are right. I changed the schema while writing by Spark 2.4 and everything is ok.

SparkQA · 2020-03-19T22:41:38Z

Test build #120067 has finished for PR 27953 at commit 2464c90.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-20T05:58:04Z

thanks, merging to master/3.0!

The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.avro.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Re-use 2 methods of `DateTimeUtils` added by the PR #27915 for rebasing microseconds. 2. Re-use 2 methods of `DateTimeUtils` added by the PR #27915 for rebasing days. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on. 5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types. For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ |date | +----------+ |1001-01-07| +----------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ |date | +----------+ |1001-01-01| +----------+ ``` 1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro") scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df2: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> :paste // Entering paste mode (ctrl-D to finish) val timestampSchema = s""" | { | "namespace": "logical", | "type": "record", | "name": "test", | "fields": [ | {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null} | ] | } |""".stripMargin // Exiting paste mode, now interpreting. scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro") ``` 2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. : - `rebasing microseconds timestamps in write` - `rebasing milliseconds timestamps in write` - `rebasing dates in write` Closes #27953 from MaxGekk/rebase-avro-datetime. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4766a36) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2020-03-20T06:14:39Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala

+    spark.read.format("avro").load(url.toString)
+  }
+
+  test("SPARK-31183: compatibility with Spark 2.4 in reading dates/timestamps") {


missed one thing. I think the test is not very related to logical types and probably should be put in AvroSuite.

@MaxGekk can you move the test in your next PR?

Do you mean only this test, correct?

All the new tests added here. The are more about compatibility, not logical type.

HyukjinKwon · 2020-03-20T07:01:19Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+          // (the `null` case), output the timestamp value as with millisecond precision.
+          case null | _: TimestampMillis => (getter, ordinal) =>
+            val micros = getter.getLong(ordinal)
+            val rebasedMicros = if (rebaseDateTime) {


One more thing, why don't we return a function rather than checking rebaseDateTime for every time?

I assumed timestamps in milliseconds is rare case. By default, Spark writes microseconds.

Checking the boolean flag shouldn't have significant overhead.

If the function is hot, jvm should optimize it

I can move the flag checking out of the function body in a follow PR, or in the same for #27953 (comment)

I think it's easy to switch with almost no additional complexity. Seems fine to change rather than relying on other optimization like JIT, or having a bad example.

…ck the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes #27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…ck the rebase flag out of function bodies 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. No By running Avro tests via the command `build/sbt avro/test` Closes #27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…d check the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. This is a backport of #27964 ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes #27977 from MaxGekk/rebase-avro-datetime-followup-3.0. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a1dbcd1) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.avro.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Re-use 2 methods of `DateTimeUtils` added by the PR apache#27915 for rebasing microseconds. 2. Re-use 2 methods of `DateTimeUtils` added by the PR apache#27915 for rebasing days. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on. 5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ |date | +----------+ |1001-01-07| +----------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ |date | +----------+ |1001-01-01| +----------+ ``` ### How was this patch tested? 1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro") scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df2: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> :paste // Entering paste mode (ctrl-D to finish) val timestampSchema = s""" | { | "namespace": "logical", | "type": "record", | "name": "test", | "fields": [ | {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null} | ] | } |""".stripMargin // Exiting paste mode, now interpreting. scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro") ``` 2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. : - `rebasing microseconds timestamps in write` - `rebasing milliseconds timestamps in write` - `rebasing dates in write` Closes apache#27953 from MaxGekk/rebase-avro-datetime. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ck the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by apache#27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes apache#27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes apache#28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 8 commits March 18, 2020 18:03

Add the SQL config spark.sql.legacy.avro.rebaseDateTime.enabled

23b013b

Add test for read files saved by Spark 2.4.5

9f1c0ea

Copy-paste the rebase functions

a9b4b8a

Apply rebasing in AvroDeserializer

de064d1

Copy-paste tests to DateTimeUtilsSuite

5837181

Merge remote-tracking branch 'remotes/origin/master' into rebase-avro…

7c74989

…-datetime # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala

Add tests for write path

61cf83b

Rebase timestamps/dates in write

9a96af0

Merge remote-tracking branch 'remotes/origin/master' into rebase-avro…

2e1cee1

…-datetime # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Add a test for millis timestamp

dac03f2

MaxGekk changed the title ~~[WIP][SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in avro~~ [SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro Mar 19, 2020

cloud-fan reviewed Mar 19, 2020

View reviewed changes

MaxGekk added 2 commits March 19, 2020 21:08

Test reading 2.4 timestamps in millis

dd24d91

Add stripMargin

19a5ff7

Test for just long

2464c90

cloud-fan closed this in 4766a36 Mar 20, 2020

cloud-fan reviewed Mar 20, 2020

View reviewed changes

HyukjinKwon reviewed Mar 20, 2020

View reviewed changes

MaxGekk mentioned this pull request Mar 20, 2020

[SPARK-31183][SQL][FOLLOWUP] Move rebase tests to AvroSuite and check the rebase flag out of function bodies #27964

Closed

MaxGekk mentioned this pull request Mar 22, 2020

[SPARK-31183][SQL][FOLLOWUP][3.0] Move rebase tests to AvroSuite and check the rebase flag out of function bodies #27977

Closed

MaxGekk mentioned this pull request Mar 29, 2020

[SPARK-31296][SQL][TESTS] Benchmark date-time rebasing in Parquet datasource #28057

Closed

MaxGekk deleted the rebase-avro-datetime branch June 5, 2020 19:46

[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro #27953

[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro #27953

Uh oh!

Conversation

MaxGekk commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Mar 18, 2020

Uh oh!

SparkQA commented Mar 18, 2020

Uh oh!

MaxGekk commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 19, 2020

Uh oh!

MaxGekk commented Mar 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

MaxGekk commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 19, 2020

Uh oh!

cloud-fan commented Mar 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Mar 18, 2020 •

edited

Loading

MaxGekk commented Mar 19, 2020 •

edited

Loading

MaxGekk commented Mar 19, 2020 •

edited

Loading