[SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI #28705

MaxGekk · 2020-06-02T09:20:11Z

What changes were proposed in this pull request?

Set spark.sql.datetime.java8API.enabled to true in hiveResultString(), and restore it back at the end of the call.

This is the cherry-pick of #27552 from which I reverted the changes in SparkExecuteStatementOperation.scala because @juliuszsompolski PR #28671 solves the issue of converting Java 8 types to strings.

Why are the changes needed?

Date and timestamp string literals are parsed by using Java 8 time API and Spark's session time zone. Before the changes, date/timestamp values were collected as legacy types java.sql.Date/java.sql.Timestamp, and the value of such types didn't respect the config spark.sql.session.timeZone. To have consistent view, users had to keep JVM time zone and Spark's session time zone in sync.
After the changes, formatting of date values doesn't depend on JVM time zone.
While returning dates/timestamps of Java 8 type, we can avoid dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian), and the issues related to calendar switching.
Properly handle negative years (BCE).
Consistent conversion of date/timestamp strings to/from internal Catalyst types in both direction to and from Spark.
The changes make possible the optimisation made in [SPARK-31878][SQL] Create date formatter only once in HiveResult #28687 because the date formatter becomes independent from JVM time zone settings.

Does this PR introduce any user-facing change?

Yes. Before:

spark-sql> select make_date(-44, 3, 15);
0045-03-15

After:

spark-sql> select make_date(-44, 3, 15);
-0044-03-15

How was this patch tested?

By running hive-thiftserver tests. In particular:

./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"

### What changes were proposed in this pull request? - Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call. - Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting. ### Why are the changes needed? Because of textual representation of timestamps/dates before 1582 year is incorrect: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:07:02 ``` It must be 1001-01-01 00:**00:00**. ### Does this PR introduce any user-facing change? Yes. After the changes: ```shell $ export TZ="America/Los_Angeles" $ ./bin/spark-sql -S ``` ```sql spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20'); 1001-01-01 00:00:00 ``` ### How was this patch tested? By running hive-thiftserver tests. In particular: ``` ./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite" ``` Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2020-06-02T14:13:45Z

Test build #123428 has finished for PR 28705 at commit 4ea85ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-06-02T17:34:39Z

sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala

-    val executedPlan2 = df.selectExpr("array(b)").queryExecution.executedPlan
-    val result2 = HiveResult.hiveResultString(executedPlan2)
-    assert(result2 == dates.map(x => s"[$x]"))
+    withOutstandingZoneIds {


After enabling Java 8 API, the result is always the same independently from JVM and Spark's session time zones. Before it wasn't possible.

So, when Java 8 API is off, the test will fail because Java 7 based date/timestamp conversions depend on JVM time zone on the executors side and HiveResult side. If they are not in sync each other and with Spark session time zone, the result can be wrong.

MaxGekk · 2020-06-02T17:56:43Z

@cloud-fan @HyukjinKwon @juliuszsompolski Please, review this PR.

MaxGekk · 2020-06-02T19:18:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala

+      jvmZoneId <- outstandingZoneIds
+      sessionZoneId <- outstandingZoneIds


This should test the cases when jvm default time zone and session time zone are not in sync.

juliuszsompolski · 2020-06-02T19:28:52Z

Could you make the PR title more precise by changing "in Thrift server" to "in Thriftserver SQL CLI"
This change actually affects only SQL CLI, not Thriftserver used as ODBC/JDBC server.

juliuszsompolski · 2020-06-02T19:48:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

+        val sessionWithJava8DatetimeEnabled = {
+          val cloned = ds.sparkSession.cloneSession()
+          cloned.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true)
+          cloned
+        }
+        sessionWithJava8DatetimeEnabled.withActive {
+          // We cannot collect the original dataset because its encoders could be created
+          // with disabled Java 8 date-time API.
+          val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.logicalPlan)
+            .queryExecution
+            .executedPlan
+            .executeCollectPublic().map(_.toSeq).toSeq
+          // We need the types so we can output struct field names
+          val types = executedPlan.output.map(_.dataType)
+          // Reformat to match hive tab delimited output.
+          result.map(_.zip(types).map(e => toHiveString(e)))
+            .map(_.mkString("\t"))
+        }
+    }


SparkSQLCLIDriver is the only non-test user of this function, and if we want the CLI to always use the new Java 8 date-time APIs, I think we could better explicitly set it there, rather than cloning the session, and cloning the Dataset here to do it.

+1. This also reminds me of #28671. Is it possible to always enable java 8 time API in the thrifter server?

This is what I did originally in previous PR 916838a but somehow we came up to the conclusion of cloning session and set Java 8 Api in hiveResultString() locally.

SparkQA · 2020-06-02T22:52:02Z

Test build #123445 has finished for PR 28705 at commit ee69a1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2020-06-03T10:49:21Z

@cloud-fan

Is it possible to always enable java 8 time API in the thrifter server?

We could, the cleanest way would be to add the set config to the session config in withLocalProperties in #28671.
But the session timezone issues there are resolved by that PR already anyway - the Timestamp object collected from the query gets converted to string using HiveResult.toHiveString anyway. No matter if it's a Timestamp, or an Instant, it will result in the same string, working correctly with the session timezone....

juliuszsompolski · 2020-06-03T10:53:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

-      // We need the types so we can output struct field names
-      val types = executedPlan.output.map(_.dataType)
-      // Reformat to match hive tab delimited output.
-      result.map(_.zip(types).map(e => toHiveString(e)))


@MaxGekk, thinking about what I just wrote to @cloud-fan, would the toHiveString here already handle conversion using the correct session timezone, not the JVM timezone?
Or is there some other case that doesn't work? E.g. about hybrid vs. proleptic calendar?

If that is the case, then we should also set the DATETIME_JAVA8API_ENABLED in the withLocalProperties around Thriftserver JDBC/ODBC operations, to make it work correctly also there.

The problem is in dateFormatter. Currently, its legacy formatter which is used for java.sql.Date doesn't respect to the Spark session time zone and depends on JVM time zone. It works fine for Java 8 LocalDate and respect the session time zone.

I tried to fix the issue in #28709 but the fix brings more troubles than fixes.

I do believe the proper solution is to switch to Java 8 Api.

Do date formatting depend on timezone for output formatting?
I thought that timezone is only needed for date parsing, for special cases such as 'now' or 'today' or 'yesterday'?
Or is it the hybrid/proleptic calendar output formatting depend on the timezone?

@juliuszsompolski See the comment https://github.com/apache/spark/pull/28687/files#diff-303a0a0e4383242d22307edcdd82e1f1R75-R82 in the PR #28687

Following that example:

$ export TZ="Europe/Moscow" $ ./bin/spark-sql -S spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select date '2020-06-03'; 2020-06-02 spark-sql> select make_date(2020, 6, 3); 2020-06-02

Could you explain why does the make_date(2020, 6, 3) -> 2020-06-02 happens?
Does make_date create a date of midnight 2020-6-3 in Moscow TZ, and it gets returned in America/Los_Angeles, where it is still 2020-6-2?
Could you explain step by step with examples what type and what timezone is used during parsing, during collecting, and for the string display before and after the changes?

The date literal '2020-06-03' (and make_date(2020, 6, 3)) is converted to the number of days since the epoch '1970-01-01'. The result is 18416, and it doesn't depend on time zone. You get the same via Java 8 API:

scala> println(LocalDate.of(2020, 6, 3).toEpochDay) 18416

The number is stored as date value internally in Spark.

To print it out, we should collect it and convert to string. The following steps are for Java 8 OFF:

The days are converted to java.sql.Date by toJavaDate() which is called from

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala

Lines 306 to 309 in b917a65

override def toScala(catalystValue: Any): Date =

if (catalystValue == null) null else DateTimeUtils.toJavaDate(catalystValue.asInstanceOf[Int])

override def toScalaImpl(row: InternalRow, column: Int): Date =

DateTimeUtils.toJavaDate(row.getInt(column))

.

toJavaDate() has to create an instance of java.sql.Date from milliseconds since the epoch 1970-01-01 00:00:00Z in UTC time zone. It converts the days 18416 to milliseconds via 18416 * 86400000 and gets 1591142400000.

1591142400000 is interpreted as local milliseconds in the JVM time zone Europe/Moscow which has wall clock offset of 10800000 millis or 3 hours. So, 1591142400000 is shifted by 10800000 to get "UTC timestamp". The result is 1591131600000 which is:

2020-06-02T21:00:00 in UTC

2020-06-03T00:00:00 in Europe/Moscow

2020-06-02T14:00:00 in America/Los_Angeles

new Date(1591131600000) is collected and formatted in toHiveString by using the legacy date formatter. Currently, the legacy date formatter ignores Spark session time zone America/Los_Angeles and uses JVM time zone Europe/Moscow. In this way, it converts new Date(1591131600000) = 2020-06-03T00:00:00 in Europe/Moscow to 2020-06-03. Looks fine but after this PR [SPARK-31901][SQL] Use the session time zone in legacy date formatters #28709, it takes America/Los_Angeles and performs the conversion 2020-06-02T14:00:00 America/Los_Angeles to 2020-06-02

So, the problem is in toJavaDate() which still uses the default JVM time zone.

And one more nuance, the legacy type java.sql.Date is not local date as Java 8 type java.time.LocalDate is. It is actually a timestamp in UTC linked to the JVM time zone. Using it as a local date is not good idea at all but this is Spark's legacy code.

So everything is OK now? Enabling jave 8 time API is only for better performance and support negative year?

BTW I doubt if we can support negative year in thriftserver. Even if the server-side can generate the datetime string correctly. The client-side parse the string using Timestamp.of which doesn't support negative year.

Everything is ok when JVM time zone on executors (where toJavaDate is called) is equal to JVM time zone on the driver (where HiveResult is initialized). And both JVM time zones are equal to Spark's session time zone.

…server-java8-time-api-2 # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala

SparkQA · 2020-06-04T01:52:35Z

Test build #123500 has finished for PR 28705 at commit 74dfe4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CheckOverflowInSum(

juliuszsompolski · 2020-06-04T11:47:47Z

Thanks for the explanations @MaxGekk . I agree it's a good fix to make it use the Java8 APIs all across the Thriftserver.
I would however do it in SparkSQLDriver like you did in 916838a instead of changing toHiveResult, and I would also add it to the withLocalProperties wrapper in SparkOperation, that will apply it to the Thriftserver JDBC/ODBC operations.
WDYT @cloud-fan ?

cloud-fan · 2020-06-04T12:27:45Z

I agree. Both SQL CLI and thriftserver have their own protocols, and Spark won't leak datetime objects to end-users. Turning on java 8 time API is only for internal consistency.

MaxGekk · 2020-06-04T17:48:43Z

Here is the PR #28729 where I enabled Java 8 time API as @juliuszsompolski proposed.

MaxGekk · 2020-06-05T16:29:41Z

I am closing this PR because #28729 has enabled Java 8 time API in Thrift server already

cloud-fan · 2020-06-05T16:35:55Z

how about the SQL CLI (SQL shell)?

MaxGekk · 2020-06-05T16:47:41Z

@cloud-fan The PR #28729 enabled Java 8 time API in CLI as well, see the test https://github.com/apache/spark/pull/28729/files#diff-f3b00321aca176ae6c1aa38ba034141eR555-R559

MaxGekk · 2020-06-05T16:56:47Z

FYI, we run ThriftServerQueryTestSuite - SQL tests in sql/core/src/test/resources/sql-tests under Java 8 time API ON but SQLQueryTestSuite with Java 8 time API OFF. So, we run the same tests for both values of spark.sql.datetime.java8API.enabled.

juliuszsompolski · 2020-06-05T17:03:04Z

FYI, we run ThriftServerQueryTestSuite - SQL tests in sql/core/src/test/resources/sql-tests under Java 8 time API ON but SQLQueryTestSuite with Java 8 time API OFF. So, we run the same tests for both values of spark.sql.datetime.java8API.enabled.

There is another difference in that SQLQueryTestSuite runs with Spark Dataframe API, so you collect the results directly using the Spark APIs, while ThriftServerQueryTestSuite runs it via JDBC connection and via Hive JDBC driver, so Date/Timestamps are being printed to string, and parsed back from String into java.sql.Date / java.sql.Timestamp by the Hive JDBC driver.
So there is more differences than one running with Java 8 API ON, and the other with OFF.

MaxGekk added 2 commits June 2, 2020 11:51

Re-gen date.sql.out

4ea85ad

probot-autolabeler bot added the SQL label Jun 2, 2020

Test with different jvm and session zone ids

ee69a1b

MaxGekk commented Jun 2, 2020

View reviewed changes

MaxGekk changed the title ~~[WIP][SPARK-30808][SQL] Enable Java 8 time API in Thrift server~~ [SPARK-30808][SQL] Enable Java 8 time API in Thrift server Jun 2, 2020

MaxGekk commented Jun 2, 2020

View reviewed changes

juliuszsompolski reviewed Jun 2, 2020

View reviewed changes

juliuszsompolski reviewed Jun 3, 2020

View reviewed changes

MaxGekk changed the title ~~[SPARK-30808][SQL] Enable Java 8 time API in Thrift server~~ [SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI Jun 3, 2020

Merge remote-tracking branch 'remotes/origin/master' into hive-thrift…

74dfe4d

…server-java8-time-api-2 # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala

MaxGekk closed this Jun 5, 2020

MaxGekk deleted the hive-thriftserver-java8-time-api-2 branch June 5, 2020 19:45

		jvmZoneId <- outstandingZoneIds
		sessionZoneId <- outstandingZoneIds

	override def toScala(catalystValue: Any): Date =
	if (catalystValue == null) null else DateTimeUtils.toJavaDate(catalystValue.asInstanceOf[Int])
	override def toScalaImpl(row: InternalRow, column: Int): Date =
	DateTimeUtils.toJavaDate(row.getInt(column))

[SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI #28705

[SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI #28705

Uh oh!

Conversation

MaxGekk commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jun 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliuszsompolski commented Jun 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 2, 2020

Uh oh!

juliuszsompolski commented Jun 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 4, 2020

Uh oh!

juliuszsompolski commented Jun 4, 2020

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

MaxGekk commented Jun 4, 2020

Uh oh!

MaxGekk commented Jun 5, 2020

Uh oh!

cloud-fan commented Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Jun 5, 2020

Uh oh!

MaxGekk commented Jun 5, 2020

Uh oh!

juliuszsompolski commented Jun 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Jun 2, 2020 •

edited

Loading

cloud-fan commented Jun 5, 2020 •

edited

Loading