-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30808][SQL] Enable Java 8 time API in in Thriftserver SQL CLI #28705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
### What changes were proposed in this pull request?
- Set `spark.sql.datetime.java8API.enabled` to `true` in `hiveResultString()`, and restore it back at the end of the call.
- Convert collected `java.time.Instant` & `java.time.LocalDate` to `java.sql.Timestamp` and `java.sql.Date` for correct formatting.
### Why are the changes needed?
Because of textual representation of timestamps/dates before 1582 year is incorrect:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:07:02
```
It must be 1001-01-01 00:**00:00**.
### Does this PR introduce any user-facing change?
Yes. After the changes:
```shell
$ export TZ="America/Los_Angeles"
$ ./bin/spark-sql -S
```
```sql
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> SELECT DATE_TRUNC('MILLENNIUM', DATE '1970-03-20');
1001-01-01 00:00:00
```
### How was this patch tested?
By running hive-thiftserver tests. In particular:
```
./build/sbt -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver "hive-thriftserver/test:testOnly *SparkThriftServerProtocolVersionsSuite"
```
Closes apache#27552 from MaxGekk/hive-thriftserver-java8-time-api.
Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
# Conflicts:
# sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
# sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
|
Test build #123428 has finished for PR 28705 at commit
|
| val executedPlan2 = df.selectExpr("array(b)").queryExecution.executedPlan | ||
| val result2 = HiveResult.hiveResultString(executedPlan2) | ||
| assert(result2 == dates.map(x => s"[$x]")) | ||
| withOutstandingZoneIds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After enabling Java 8 API, the result is always the same independently from JVM and Spark's session time zones. Before it wasn't possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, when Java 8 API is off, the test will fail because Java 7 based date/timestamp conversions depend on JVM time zone on the executors side and HiveResult side. If they are not in sync each other and with Spark session time zone, the result can be wrong.
|
@cloud-fan @HyukjinKwon @juliuszsompolski Please, review this PR. |
| jvmZoneId <- outstandingZoneIds | ||
| sessionZoneId <- outstandingZoneIds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should test the cases when jvm default time zone and session time zone are not in sync.
|
Could you make the PR title more precise by changing "in Thrift server" to "in Thriftserver SQL CLI" |
| val sessionWithJava8DatetimeEnabled = { | ||
| val cloned = ds.sparkSession.cloneSession() | ||
| cloned.conf.set(SQLConf.DATETIME_JAVA8API_ENABLED.key, true) | ||
| cloned | ||
| } | ||
| sessionWithJava8DatetimeEnabled.withActive { | ||
| // We cannot collect the original dataset because its encoders could be created | ||
| // with disabled Java 8 date-time API. | ||
| val result: Seq[Seq[Any]] = Dataset.ofRows(ds.sparkSession, ds.logicalPlan) | ||
| .queryExecution | ||
| .executedPlan | ||
| .executeCollectPublic().map(_.toSeq).toSeq | ||
| // We need the types so we can output struct field names | ||
| val types = executedPlan.output.map(_.dataType) | ||
| // Reformat to match hive tab delimited output. | ||
| result.map(_.zip(types).map(e => toHiveString(e))) | ||
| .map(_.mkString("\t")) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkSQLCLIDriver is the only non-test user of this function, and if we want the CLI to always use the new Java 8 date-time APIs, I think we could better explicitly set it there, rather than cloning the session, and cloning the Dataset here to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This also reminds me of #28671. Is it possible to always enable java 8 time API in the thrifter server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I did originally in previous PR 916838a but somehow we came up to the conclusion of cloning session and set Java 8 Api in hiveResultString() locally.
|
Test build #123445 has finished for PR 28705 at commit
|
We could, the cleanest way would be to add the set config to the session config in withLocalProperties in #28671. |
| // We need the types so we can output struct field names | ||
| val types = executedPlan.output.map(_.dataType) | ||
| // Reformat to match hive tab delimited output. | ||
| result.map(_.zip(types).map(e => toHiveString(e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, thinking about what I just wrote to @cloud-fan, would the toHiveString here already handle conversion using the correct session timezone, not the JVM timezone?
Or is there some other case that doesn't work? E.g. about hybrid vs. proleptic calendar?
If that is the case, then we should also set the DATETIME_JAVA8API_ENABLED in the withLocalProperties around Thriftserver JDBC/ODBC operations, to make it work correctly also there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is in dateFormatter. Currently, its legacy formatter which is used for java.sql.Date doesn't respect to the Spark session time zone and depends on JVM time zone. It works fine for Java 8 LocalDate and respect the session time zone.
I tried to fix the issue in #28709 but the fix brings more troubles than fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do believe the proper solution is to switch to Java 8 Api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do date formatting depend on timezone for output formatting?
I thought that timezone is only needed for date parsing, for special cases such as 'now' or 'today' or 'yesterday'?
Or is it the hybrid/proleptic calendar output formatting depend on the timezone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following that example:
$ export TZ="Europe/Moscow"
$ ./bin/spark-sql -S
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select date '2020-06-03';
2020-06-02
spark-sql> select make_date(2020, 6, 3);
2020-06-02
Could you explain why does the make_date(2020, 6, 3) -> 2020-06-02 happens?
Does make_date create a date of midnight 2020-6-3 in Moscow TZ, and it gets returned in America/Los_Angeles, where it is still 2020-6-2?
Could you explain step by step with examples what type and what timezone is used during parsing, during collecting, and for the string display before and after the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The date literal '2020-06-03' (and make_date(2020, 6, 3)) is converted to the number of days since the epoch '1970-01-01'. The result is 18416, and it doesn't depend on time zone. You get the same via Java 8 API:
scala> println(LocalDate.of(2020, 6, 3).toEpochDay)
18416The number is stored as date value internally in Spark.
- To print it out, we should collect it and convert to string. The following steps are for Java 8 OFF:
-
The days are converted to
java.sql.DatebytoJavaDate()which is called from.spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
Lines 306 to 309 in b917a65
override def toScala(catalystValue: Any): Date = if (catalystValue == null) null else DateTimeUtils.toJavaDate(catalystValue.asInstanceOf[Int]) override def toScalaImpl(row: InternalRow, column: Int): Date = DateTimeUtils.toJavaDate(row.getInt(column)) -
toJavaDate()has to create an instance of java.sql.Date from milliseconds since the epoch 1970-01-01 00:00:00Z in UTC time zone. It converts the days 18416 to milliseconds via 18416 * 86400000 and gets 1591142400000. -
1591142400000 is interpreted as local milliseconds in the JVM time zone
Europe/Moscowwhich has wall clock offset of 10800000 millis or 3 hours. So, 1591142400000 is shifted by 10800000 to get "UTC timestamp". The result is 1591131600000 which is:2020-06-02T21:00:00in UTC2020-06-03T00:00:00in Europe/Moscow2020-06-02T14:00:00in America/Los_Angeles
-
new Date(1591131600000) is collected and formatted in
toHiveStringby using the legacy date formatter. Currently, the legacy date formatter ignores Spark session time zoneAmerica/Los_Angelesand uses JVM time zoneEurope/Moscow. In this way, it convertsnew Date(1591131600000)=2020-06-03T00:00:00in Europe/Moscow to2020-06-03. Looks fine but after this PR [SPARK-31901][SQL] Use the session time zone in legacy date formatters #28709, it takesAmerica/Los_Angelesand performs the conversion2020-06-02T14:00:00 America/Los_Angelesto2020-06-02
So, the problem is in toJavaDate() which still uses the default JVM time zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more nuance, the legacy type java.sql.Date is not local date as Java 8 type java.time.LocalDate is. It is actually a timestamp in UTC linked to the JVM time zone. Using it as a local date is not good idea at all but this is Spark's legacy code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So everything is OK now? Enabling jave 8 time API is only for better performance and support negative year?
BTW I doubt if we can support negative year in thriftserver. Even if the server-side can generate the datetime string correctly. The client-side parse the string using Timestamp.of which doesn't support negative year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is ok when JVM time zone on executors (where toJavaDate is called) is equal to JVM time zone on the driver (where HiveResult is initialized). And both JVM time zones are equal to Spark's session time zone.
…server-java8-time-api-2 # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala
|
Test build #123500 has finished for PR 28705 at commit
|
|
Thanks for the explanations @MaxGekk . I agree it's a good fix to make it use the Java8 APIs all across the Thriftserver. |
|
I agree. Both SQL CLI and thriftserver have their own protocols, and Spark won't leak datetime objects to end-users. Turning on java 8 time API is only for internal consistency. |
|
Here is the PR #28729 where I enabled Java 8 time API as @juliuszsompolski proposed. |
|
I am closing this PR because #28729 has enabled Java 8 time API in Thrift server already |
|
how about the SQL CLI (SQL shell)? |
|
@cloud-fan The PR #28729 enabled Java 8 time API in CLI as well, see the test https://github.com/apache/spark/pull/28729/files#diff-f3b00321aca176ae6c1aa38ba034141eR555-R559 |
|
FYI, we run |
There is another difference in that SQLQueryTestSuite runs with Spark Dataframe API, so you collect the results directly using the Spark APIs, while ThriftServerQueryTestSuite runs it via JDBC connection and via Hive JDBC driver, so Date/Timestamps are being printed to string, and parsed back from String into java.sql.Date / java.sql.Timestamp by the Hive JDBC driver. |
What changes were proposed in this pull request?
Set
spark.sql.datetime.java8API.enabledtotrueinhiveResultString(), and restore it back at the end of the call.This is the cherry-pick of #27552 from which I reverted the changes in
SparkExecuteStatementOperation.scalabecause @juliuszsompolski PR #28671 solves the issue of converting Java 8 types to strings.Why are the changes needed?
java.sql.Date/java.sql.Timestamp, and the value of such types didn't respect the configspark.sql.session.timeZone. To have consistent view, users had to keep JVM time zone and Spark's session time zone in sync.HiveResult#28687 because the date formatter becomes independent from JVM time zone settings.Does this PR introduce any user-facing change?
Yes. Before:
After:
How was this patch tested?
By running hive-thiftserver tests. In particular: